Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics and Bioinformatics University of California, Irvine

About This Presentation

Title:

Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics and Bioinformatics University of California, Irvine

Description:

... 94 residues) ACKNOWLEDGMENTS UCI: Gianluca Pollastri, Pierre-Francois Baisnee, Michal Rosen-Zvi Arlo Randall, S. Joshua Swamidass, Jianlin Cheng, Yimeng Dou, ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 93

Provided by: dau108

Category:

more less

Transcript and Presenter's Notes

Title: Predicting Protein Structures and Structural Features on a Genomic Scale Pierre Baldi School of Information and Computer Sciences Institute for Genomics and Bioinformatics University of California, Irvine

1
Predicting Protein Structures and Structural
Features on a Genomic ScalePierre BaldiSchool
of Information and Computer SciencesInstitute
for Genomics and BioinformaticsUniversity of
California, Irvine
2
UNDERSTANDING INTELLIGENCE

Human intelligence (inverse problem)
AI (direct problem)
Choice of specific problems is key
Protein structure prediction is a good problem

3
PROTEINS

R1
R3
Ca N Cß
Ca
/ \ / \ /
\ / \
N Cß Ca
N Cß
R2

4
(No Transcript)
5
Utility of Structural Information
(Baker and Sali, 2001)
6
CAVEAT
7
REMARKS

Structure/Folding
Backbone/Full Atom
Homology Modeling
Fold Recognition (Threading)
Ab Initio (Physical Potentials/Molecular
Dynamics, Statistical Mechanics/Lattice Models)
Statistical/Machine Learning (Training Sets, SS
prediction)
Mixtures ab-initio with statistical potentials,
machine learning with profiles, etc.

8
PROTEIN STRUCTURE PREDICTION (ab initio)
9
(No Transcript)
10
Helices

1GRJ (Grea Transcript Cleavage Factor From
Escherichia Coli)

11
Antiparallel ß-sheets

1MSC (Bacteriophage Ms2 Unassembled Coat Protein
Dimer)

12
Parallel ß-sheets

1FUE (Flavodoxin)

13
Contact map
14
Secondary structure prediction
15
GRAPHICAL MODELS BAYESIAN NETWORKS

X1, ,Xn random variables associated with the
vertices of a DAG Directed Acyclic Graph
The local conditional distributions P(XiXj j
parent of i) are the parameters of the model.
They can be represented by look-up tables
(costly) or other more compact parameterizations
(Sigmoidal Belief Networks, XOR, etc).
The global distribution is the product of the
local characteristicsP(X1,,Xn) ?i P(XiXj
j parent of i)

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
DATA PREPARATION

Starting point PDB data base.
Remove sequences not determined by X ray
diffraction.
Remove sequences where DSSP crashes.
Remove proteins with physical chain
breaks (neighboring AA having
distances exceeding 4 Angstroms)
Remove sequences with resolution worst
than 2.5 Angstroms.
Remove chains with less than 30 AA.
Remove redundancy (Hobohms algorithm,
Smith-Waterman, PAM 120, etc.)
Build multiple alignments (BLAST,
PSI-BLAST, etc.)

22
SECONDARY STRUCTURE PROGRAMS

DSSP (Kabsch and Sander, 1983) works by
assigning potential backbone hydrogen bonds
(based on the 3D coordinates of the backbone
atoms) and subsequently by identifying repetitive
bonding patterns.
STRIDE (Frishman and Argos, 1995) in addition
to hydrogen bonds, it uses also dihedral angles.
DEFINE (Richards and Kundrot, 1988) uses
difference distance matrices for evaluating the
match of interatomic distances in the protein to
those from idealized SS.

23
SECONDARY STRUCTURE ASSIGNMENTS

DSSP classes
H alpha helix
E sheet
G 3-10 helix
S kind of turn
T beta turn
B beta bridge
I pi-helix (very rare)
C the rest
CASP (harder) assignment
a H and G
ß E and B
? the rest
Alternative assignment
a H
ß B
? the rest

24
ENSEMBLES
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
FUNDAMENTAL LIMITATIONS

100 CORRECT RECOGNITION IS PROBABLY IMPOSSIBLE
FOR SEVERAL REASONS
SOME PROTEINS DO NOT FOLD SPONTANEOUSLY OR MAY
NEED CHAPERONES
QUATERNARY STRUCTURE BETA-STRAND PARTNERS MAY BE
ON A DIFFERENT CHAIN
STRUCTURE MAY DEPEND ON OTHER VARIABLES
ENVIRONMENT, PH
DYNAMICAL ASPECTS
FUZZINESS OF DEFINITIONS AND ERRORS IN DATABASES

29
(No Transcript)
30
(No Transcript)
31
BB-RNNs
32
2D RNNs
33
2D INPUTS

AA at positions i and j
Profiles at positions i and j
Correlated profiles at positions i and j
Secondary Structure, Accessibility, etc.

34
(No Transcript)
35
PERFORMANCE ()
6Å 8Å 10Å 12Å
non-contacts 99.9 99.8 99.2 98.9
contacts 71.2 65.3 52.2 46.6
all 98.5 97.1 93.2 88.5
36
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1HCR, chain A Sequence
GRPRAINKHEQEQISRLLEKGHPRQQLAIIFGIGVSTLYRYFPASSIKKR
MN True SS CCCCCCCCHHHHHHHHHHHCCCCHHHHHHHCECCHHH
HHHHCCCCCCCCCCC Pred SS CCCCCCCHHHHHHHHHHHHCCCCH
HHHEEHECHHHHHHHHCCCHHHHHHHCC
PDB ID 1HCR Chain A (52 residues)
Model 147 RMSD 3.47Å
37
Protein Reconstruction
Using predicted secondary structure and predicted
contact map
PDB ID 1BC8, chain C Sequence
MDSAITLWQFLLQLLQKPQNKHMICWTSNDGQFKLLQAEEVARLWGIRKN
KPNMNYDKLSRALRYYYVKNIIKKVNGQKFVYKFVSYPEILNM True
SS CCCCCCHHHHHHHHCCCHHHCCCCEECCCCCEEECCCHHHHHHHH
HHHHCCCCCCHHHHHHHHHHHHHHCCEEECCCCCCEEEECCCCHHHCC P
red SS CCCHHHHHHHHHHHHHCCCCCCEEEEECCCEEEEECCHHHH
HHHHHHHCCCCCCCHHHHHHHHHHHHHCCCEEECCCCEEEEEEECCHHHH
CC
PDB ID 1BC8 Chain C (93 residues)
Model 1714 RMSD 4.21Å
38
CASP6 Self AssessmentEvaluation based on GDT_TS
of first submitted model GDT_TS Global
Distance Test Total ScoreGDT_TS (GDT_P1
GDT_P2 GDT_P4 GDT_P8 ) / 4 Pn percentage
of residues under distance cutoff n
39
Hard Target Summary

Top 10 groups displayed, of 65 registered servers
Assessment on 25 new fold and fold recognition
analogous target domains

N number of targets predicted Av.R.
average rank sumZ sum of Z scores on all
targets in set sumZpos sum of Z scores for
predictions with positive Z score group N Av.R.
sumZ sumZpos BAKER-ROBETTA 25 9.12 27.81 27.94 ba
ldi-group-server 24 10.04 20.47 22.33 Rokky 25
11.60 17.56 18.46 Pmodeller5 20 12.55 14.35 16.1
1 ZHOUSPARKS2 25 15.00 12.67 15.91 ACE 25 13.
08 11.63 14.89 Pcomb2 24 16.04 10.82 13.58 RAP
TOR 24 16.33 9.81 13.21 zhousp3 25 15.72 9.30
12.90 PROTINFO-AB 19 16.74 8.62 12.59
40
Hard Target Summary

Top 10 groups displayed, of 65 registered servers
Assessment on 19 new fold and fold recognition
analogous target domains less than 120 residues

N number of targets predicted Av.R.
average rank sumZ sum of Z scores on all
targets in set sumZpos sum of Z scores for
predictions with positive Z score group N Av.R.
sumZ sumZpos baldi-group-server 19 6.74 20.61 20.6
1 BAKER-ROBETTA 19 9.11 20.44 20.57 Rokky 19 1
2.11 12.30 13.20 PROTINFO-AB 16 12.63 11.53 12.5
9 ZHOUSPARKS2 19 15.32 8.91 11.87 Pcomb2 18 1
5.39 9.54 11.48 Pmodeller5 15 14.47 9.25 11.00
PROTINFO 18 16.22 8.66 10.56 ACE 19 14.21 6.98
10.24 RAPTOR 18 17.00 7.22 9.64
41
Target T0281Detailed Target Analysis

Target Information
Length 70 amino acids
Resolution 1.52 Å
PDB code 1WHZ
Description Hypothetical Protein From Thermus
Thermophilus Hb8
Domains single domain

Assessment
GDT_TS server rank of our 1st model 2
GDT_TS 51.07
RMSD to native 6.15

42
Target T0281Contact Map Comparison
note true map is lower left
True Map vs. Predicted Map
True Map vs. Recovered Map
43
Target T0281Structure Comparison
true structure
predicted structure
44
Target T0281Structure Comparison Superposition
True structure thick trace Predicted structure
thin trace
45
Target T0280_2Detailed Target Analysis

Target Information
Length 51 amino acids
Resolution 2.00 Å
PDB code 1WD5
Description Putative phosphoribosyl transferase,
T. thermophilus
Domains 2nd domain, residues 53-103 of 208 AA
sequence

Assessment
GDT_TS server rank of our 1st model 1 (also 1st
among human groups)
GDT_TS 54.41
RMSD to native 5.81

46
Target T0280_2Contact Map Comparison
note true map is lower left
True Map vs. Predicted Map
True Map vs. Recovered Map
47
Target T0281Structure Comparison
true structure
predicted structure
48
Target T0281Structure Comparison Superposition
True structure thick trace Predicted structure
thin trace
49
THE SCRATCH SUITE

www.igb.uci.edu
DOMpro domains
DISpro disordered regions
SSpro secondary structure
SSpro8 secondary structure
ACCpro accessibility
CONpro contact number
DI-pro disulphide bridges
BETA-pro beta partners
CMAP-pro contact map
CCMAP-pro coarse contact map
CON23D-pro contact map to 3D
3D-pro 3D structure (homology fold recognition
ab-initio)

50
(No Transcript)
51

SISQQTVWNQMATVRTPLNFDSSKQSFCQFSVDLLGGGISVDKTGDWITL
VQNSPISNLL
CCCECCCCCCEEEECCCCCCCCCCCCEEEEEEECCCCEEEECCCCCCEEE
EECCHHHHHH
CCCEEEEECEEEEECCCCCCCTCCCCEEEEEEEETCSEEEECTTTTEEEE
EECCHHHHHH
-----------------------------------
----
--------------------------
-------------------------------
--------------------------
eeeeee---e--e-e-eee-ee-eee---------e-e--eeeeee----
----------
RVAAWKKGCLMVKVVMSGNAAVKRSDWASLVQVFLTNSNSTEHFDACRWT
KSEPHSWELI
HHHHHHCCCEEEEEEEEEECCEEECCCCCEEEEEEEECCCCCCCCCEEEE
EECCCCCCCC
HHHHHHTTCEEEEEEEEEEEEEEECCCCCEEEEEEEECCCTTCCCEEEEE
EECCTCCEEE
-----------------------
----------
--------------------
----
-----------------
----
------------------
----
-----ee---e-------e-e-ee-e-e-e-----e--eeee--e-----
--e-e-ee-e

52
Advantage of Machine Learning

Pitfalls of traditional ab-initio approaches
Machine learning systems take time to train
(weeks).
Once trained however they can predict structures
almost faster than proteins can fold.
Predict or search protein structures on a genomic
or bioengineering scale .

53
DAG-RNNs APPROACH

Two steps
1. Build relevant DAG to connect inputs, outputs,
and hidden variables
2. Use a deterministic (neural network)
parameterization together with appropriate
stationarity assumptions/weight sharingoverall
models remains probabilistic
Process structured data of variable size,
topology, and dimensions efficiently
Sequences, trees, d-lattices, graphs, etc
Convergence theorems
Other applications

54
(No Transcript)
55
Convergence Theorems

Posterior Marginals
sBN?dBN in distribution
sBN?dBN in probability (uniformly)
Belief Propagation
sBN?dBN in distribution
sBN?dBN in probability (uniformly)

56
Structural Databases

PPDB Poxvirus Proteomic Database
ICBS Inter Chain Beta Sheet Database

57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
Strategies for drug design

Block, modulate, mediate ß-sheet interactions
Covalent modification of a chain to prevent
ß-sheet formation

61
(No Transcript)
62
(No Transcript)
63
Three-Stage Prediction of Protein Beta-Sheets
Using Neural Networks, Alignments, and Graph
Algorithms

Jianlin Cheng and Pierre Baldi
School of Info. and Computer Sci.
University of California Irvine

64
Beta-Sheet Architecture
65
Importance of Predicting Beta-Sheet Structure

AB-Initio Structure Prediction
Fold Recognition
Model Refinement
Protein Design
Protein Stability

66
Previous Work

Methods
Statistical potential approach for strand
alignment. (Hubbard, 1994 Zhu and Braun, 1999)
Statistical potentials to improve beta-sheet
secondary structure prediction.(Asogawa,1997)
Information theory approach for strand alignment.
(Steward and Thornton, 2000)
Neural networks for beta-residue contacts.
(Baldi, et.al, 2000)
Shortcomings
Focus on one single aspect not utilize
structural contexts and evolutionary information
not exploit constraints enough not publicly
available.

67
Three-Stage Prediction of Beta-Sheets

Stage 1
Predict beta-residue pairings using
2D-Recursive Neural Networks (2D-RNN).
Stage 2
Align beta-strands using alignment algorithms.
Stage 3
Predict beta-strand pairs and beta-sheet
architecture using graph algorithms.

68
Dataset and Statistics
Num
Chains 916
Beta residues 48,996
Residue Pairs 31,638
Beta Strand 10,745
Strand Pairs 8,172
Beta Sheet 2,533
69
Stage 1 Prediction of Beta-Residue Pairings
Using 2D-RNN
Target / Output Matrix (mm)
Input Matrix I (mm)
(i,j)
2D-RNN O f(I)
(i,j)
Tij 0/1 Oij Pairing Prob.
Iij
i-2 i-1 i i1 i2 j-2 j-1 j j1 j2 i-j
Total 251 inputs
20 profiles
3 SS
2 SA
70
An Example Target
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
71
An Example Output
72
Stage 2 Beta-Strand Alignment
Anti-parallel
1 m

Use output probability matrix as scoring matrix
Dynamic programming
Disallow gaps and use simplified searching
algorithms

n 1
Parallel
1 m
1 n
Total number of alignments 2(mn-1)
73
Strand Alignment and Pairing Matrix

The alignment score (Pseudo Binding Energy) is
the sum of the probabilities of paired residues.
The best alignment is the alignment with maximum
score.
Strand Pairing Matrix.

Strand Pairing Matrix of 1VJG
74
Stage 3 Prediction of Beta-Strand Pairings and
Beta-Sheet Architecture
Strand Pairing Constraints
75
Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
Goal Find a set of connected subgraphs that
maximize the sum of pseudo-energy and
satisfy the constraints. Algorithm Minimum
Spanning Tree Like Algorithm.
76
Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 1 Pair strand 4 and 5
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
Strand Pairing Matrix of 1VJG
77
Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 2 Pair strand 1 and 2
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
Strand Pairing Matrix of 1VJGA
N
78
Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 3 Pair strand 1 and 3
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
Strand Pairing Matrix of 1VJGA
N
79
Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 4 Pair strand 3 and 6
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
6
Strand Pairing Matrix of 1VJGA
N
80
Example of MST Like Algorithm
Assembly of beta-strands
1
2
3
4
5
6
7
Step 5 Pair strand 6 and 7
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
C
7
2
1
3
6
7
Strand Pairing Matrix of 1VJGA
N
81
Beta-Residue Pairing Results

Sensitivity Specificity 41
Base-line 2.3. Ratio of improvement 17.8.
ROC area 0.86
At 5 FPR, TPR is 58
CMAPpro
Spec. and Sens. is 27. ROC area0.8.
TPR42 at 5 FPR.

82
Strand Pairing Results

Naïve algorithm of pairing all adjacent strands
Specificity 42
Sensitivity 50
MST like algorithm
Specificity 53
Sensitivity 59
gt20 correctly predicted strand pairs are
non-adjacent strand pairs

83
Strand Alignment Results
On the correctly predicted pairs
Paring Direction Align. All Align. Anti-P Align. Para. Align. Bridge
Acc. 93 72 69 71 88
On all native pairs
Pairing Direction Align. All Align. Anti-P Align. Para. Align. Bridge
Acc. 84 66 63 66 73

Pairing direction is 15 higher than
of random algorithm.
Alignment accuracy is improved by gt15.

84
Application and Future Work

New methods for beta-residue pairings (e.g.
Linear Programming, SVM), and strand alignment
and pairings. More inputs (Punta and Rost, 2005).
Applications
AB-Initio Structure Sampling (beta-sheet)
Fold Recognition (conservation of beta-sheets)
Contact Map
Model Refinement (pairing direction/alignment)
Web server and dataset
http//www.ics.uci.edu/baldig/betasheet.html

85
A New Fold Example (CASP6)

1S12 (T0201, 94 residues)

True SS
CEEEEECCCEEEEECCCCCHHHHHHHHHHHHHHHHHHHHCCCEEEEEECC
EEEEEECCCCHHHHHHHHHHHHHHHHHHHHCCCCEEEEECCCCCC
Predicted SS
CEEEEEECCEEEECCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHEHHCCC
CEEEEHHHHHHHHHHHHHHHHHHHHHHHHHCCCCEEEEEEECCC
True 12, 2-4, 3-4, 1-5
Strand Pairing Matrix
1 2 3 4 5
1 0 1.71 .05 .29 .33
2 0 .06 .41 .12
3 0 .22 .04
4 0 .53
5 0
Predicted 1-2, 2-4, 3-4, 4-5
5
1
2
4
3
Rendered in Rasmol
86
ACKNOWLEDGMENTS

UCI
Gianluca Pollastri, Pierre-Francois Baisnee,
Michal Rosen-Zvi
Arlo Randall, S. Joshua Swamidass, Jianlin Cheng,
Yimeng Dou, Yann Pecout, Mike Sweredoski,
Alessandro Vullo, Lin Wu
James Nowick, Luis Villareal
DTU Soren Brunak
Columbia Burkhard Rost
U of Florence Paolo Frasconi
U of Bologna Rita Casadio, Piero Fariselli
www.igb.uci.edu/
www.ics.uci.edu/pfbaldi

87
1DFN Defensin
88
A Perfectly Predicted Example
Sequence with cysteine's position identified
MSNHTHHLKFKTLKRAWKASKYFIVGLSC29LYKFNLKSLVQTALST
LAMITLTSLVITAIIYISVGNAKAKPTSKPTIQQTQQPQNHTSPFFTEHN
YKSTHTSIQSTTLSQLLNIDTTRGITYGHSTNETQNRKIKGQSTLPATRK
PPINPSGSIPPENHQDHNNFQTLPYVPC173STC176EGNLAC18
2LSLC18 6HIETERAPSRAPTITLKKTPKPKTTKKPTKTTIHHRT
SPETKLQPKNNTATPQQG ILSSTEHHTNQSTTQI Length 257,
Total number of cysteines 5 Four bonded
cysteines form two disulfide bonds 173
-------186 ( red cysteine pair) 176 -------182
(blue cysteine pair)
Prediction Results from DIpro (http//contact.ics.
uci.edu/bridge.html) Predicted Bonded
Cysteines 173,176,182,186 Predicted disulfide
bonds Bond_Index Cys1_Position Cys2_Position 1 17
3 186 2 176 182 Prediction Accuracy for both
bond state and bond pair are 100.
89
A Hard Example with Many Non-Bonded Cysteines
Sequence with cysteine's position identified
MTLGRRLAC9LFLAC14VLPALLLGGTALASEIVGGRRARPHAWP
FMVSLQLRGGHFC55GATLIAPNFVMSAAHC71VANVNVRAVRVVL
GAHNLSRREPTRQVFAVQRIFENGYDPVNLLNDIVILQLNGSATINANVQ
VAQLPAQGRRLGNGVQC151LAMGWGLLGRNRGIASVLQELNVTVVTS
LC181RRSNVC187TLVRGRQAGVC198FGDSGSPLVC208N
GLIHGIASFVRGGC223ASGLYPDAFAPVAQFVNWIDSIIQRSEDNPC
254PHPRDPDPASRTH Length 267, Total Cysteine
Number 11 Eight bonded cysteines form four
disulfide bonds 55 ----- 71 (Red), 151 -----
208 (Blue), 181 ----- 187 (Green), 198 ----- 223
(Purple)
Prediction Results from DIpro (http//contact.ics.
uci.edu/bridge.html) Predicted Bonded
Cysteines 9,14,55,71,181,187,223,254 Predicted
Disulfide Bonds Bond_Index Cys1_Position Cys2_Pos
ition 1 55 71 (correct) 2 9 14
(wrong) 3 223 254 (wrong) 4 181 187
(correct) Bond State Recall 5 / 8 0.625,
Bond State Precision 5 / 8 0.625 Pair Recall
2 / 4 0.5 Pair Precision 2 / 4 0.5 Bond
number is predicted correctly.
90
Prediction Accuracy on SP51 Dataset on All
Cysteines
Bond Num Bond State Recall() Bond State Precision() Pair Recall() Pair Precision()
1 91 46 74 39
2 93 77 61 51
3 90 74 54 45
4 77 87 52 59
5 71 86 33 42
6 65 84 27 34
7 63 85 36 55
8 66 89 27 41
9 60 83 23 35
10 55 86 30 45
11 62 86 34 47
12 67 97 17 23
15 50 94 27 50
16 82 99 11 13
17 61 96 22 33
18 50 82 6 9
19 47 90 11 20
Overall bond state recall 78 overall bond
state precision 74 bond number prediction
accuracy 53 average difference between true
bond number and predicted bond number 1.1 .
91
CURRENT WORK