Title: Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions
1Voids and pockets in proteins packing, folding,
evolution, and biological functions
- Jie Liang
- Dept. of Bioengineering
- University of Illinois at Chicago
2Central Dogma of Molecular Biology
- DNA Genetic blueprint of life
- Fully sequenced Human genome and hundreds of
other genomes - Proteins Working molecules
DNA RNA
Protein
3Second Central Dogma
- Knowledge of protein structures is important.
4The Universe of Protein Structures
- Human genome 3 billion nucleotides
- Number of genes 20,000 25,000
- Protein families 10,000-30,000
- Number of folds 1,000 - 4,000
- Currently in PDB lt 1,000 folds
- Comparative modeling needs a structural template
with sequence identities gt 30-35 - eg. 50 of ORFs and 18 of residues of S.
cerevisiae genome - Structural Genomics populating each fold with
4-5 structures - One for each superfamily at 30-35 sequence
identities. - Fold of a novel gene can be identified
- Its structure can then be interpolated by
comparative modeling.
5 - Main chain folds
- Rich information about evolution of protein
structures. - May not directly lead to understanding of
function -
(SCOP) All beta proteins
ab proteins Ig like beta sandwich HPr
fold
(from Jaroszewski Godzik, ISMB 00)
6Predicting and characterizing protein functions
- Important, but challenging tasks
- Needs gt 60-70 sequence identity.
- Fold prediction gt20-30 sequence identity.
- Proteins from structural genomics often are of
unknown functions. - Sequence homologs are often hypothetical
proteins.
(Rost, 02, JMB Tian Skolnick, 03, JMB)
7 Functional Voids and Pockets
Ras 21
Fts Z
GDP Binding Pockets
8Outline
- Delaunay triangulation of proteins and protein
folding. - Voids and pockets in proteins
- Distribution and scaling properties
- Origin of voids and pockets
- Biologically important voids and pockets
- Assessing similarity Sequence, shape,
orientation, and significance. - Evolutionary pattern of voids and pockets
- Bayesian Markov chain Monte Carlo.
- Predicting protein functions
- Orphan proteins from structural genomics
91. Protein folding and Delaunay triangulation
- Protein folding problem.
- Protein sequence automatically fold to its native
shape. - (Anfinsen, 70s)
- Time scale of protein motion.
10 Protein-folding rates from a simple model
- What factors determine whether a protein will be
a fast or slow folder? - Native state conformations
- (Plaxco and Baker, 97, J Mol Biol)
11 Simple Model (Plaxco and Baker)
- Relative Contact Order (CO)
- L protein length
- N number of contacts lt 6 A
- ? Nij loop length between contacting residues
12Folding rates of both 2-state and 3-state proteins
- Does not work when a large number of proteins are
examined.
2-state folders
2-state folders
3-state folders
13 Alpha Shape of Protein Structure
- Identify contacts by alpha edges.
- Not by distance cut-off.
-
14 (Zheng Ouyang)
Contact order by alpha shape
L protein length N total number of residues
with neighbors n the number of neighbors of
residue I Other models also work e.g. zipper
model of Dill et al.
15Folding dynamics in lattice model with physical
movement
- Lattice 2D HP models
- Enumerating sequences and conformations.
- Exact thermodynamics.
- Exact effects of sequence variation.
- Folding dynamics
- Exact folding dynamics.
(Lau and Dill, 1989)
(Cieplak et al, 98 Banu and Dill, 00)
(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
16Sequences and conformations
- Chain length 16
- 802,075 conformations, 216 sequences
- 1,539 HP sequences fold to
- unique ground states.
- 456 structural families
- from 1 (low designability)
- to 26 (high designability) sequences
17Physical movement
Allowed moves are physically realizable on 2D
square lattice.
18Kinetics from master equation
(Cieplak et al, 98 Banu and Dill, 00)
i
j
General solution
where ?i the i-th eigenvalue of matrix M, and
ni corresponding eigenvector. ?0 0 n0 the
Boltzman distribution, ?1 the smallest non-zero
eigenvalue is taken as the protein folding rate.
19What can we meaure?
- Thermodynamic properties
- Ground state energy, energy gap.
- Heat capacity
- Folding temperature (50 of
protein in native state) - Collapse tempeate
- Collapse cooperativity
- Folding kinetics
- Folding rates.
(Klimov Thirumalai, 1998 Chan Dill, 1993)
20Folding rates and contact energy
- Folding rates
- Correlated somewhat with energy
- R - 0.84
- But wide range of folding rates for sequences of
- Same ground state energy.
- Same energy gap.
- Same fold.
? - structures with few sequences Low
designability ? - structures with many
sequences High designability ? - Go model
21Folding rate and cooperativity
- Folding rates
- Correlated somewhat with cooperativity
- R - 0.62
- Thermodynamics does not define folding rates.
? - high designability cluster ? - Go model
22Landscape properties number of local minima
- Landscape roughness
- Number of local minima
- Folding rates
- Excellent correlation.
- R - 0.92
? - high designability cluster
23Time Evolutoin of Conformation Concentration
- Matrix exponential for 85,000 by 85,000 matrix M
- Analytical solution of master equation
- P(t) eMt P(0)
- Approximate in Krylov subspace
- KmMt, P(0) span (P(0), MtP(0), ?,
(Mt)mP(0)) -
m small - Solve matrix exponential for small dense m
matrix - Páde rational functions.
- 10 million time steps.
(Stochastic roadmap selecting conformations. L
aTomb, Kavraki, Amato)
24Transient states
Transiently accumulating states demonstrate
intermediate-like behavior during the time
evolution. Many local minima are transiently
accumulating, however, they are not obligated
steps along the folding pathway.
(9 orders of time scale conformations of local
minima)
25Results from our study
- Dramatically different folding rates.
- Even for sequences that fold to the same native
structure. - Natural proteins are biased by evolution.
- (Scalley-Kim and Baker, 2004)
- Folding rates
- Not determined by native structure properties,
- Weakly related to length and thermodynamic
properties - Ground state energy, energy gap, and collapse
cooperativity. - Sequences of same thermodynamics can have very
different folding rates. - Kinematic energy landscape and folding kinetics.
- Roughness of such landscape is strongly
correlated with folding rate.
(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
262. Voids and pockets in proteins Computation
Shape library
(Mucke and Edelsbrunner, ACM Trans. Graphics.
1994. Edelsbrunner, Disc Comput Geom. 1995.
Edelsbrunner, Facello, and Liang, Discrete
Applied Math. 1998.)
(Binkowski, Adamian, and Liang, J.
Mol. Biol. 332505-526, 2003)
27Voids and Pockets in Soluble Proteins
- Protein interior is solid-like, tightly packed
like a jig-saw puzzle - High packing density (Richards, 1977)
- Low compressibility (Gavish, Gratoon, and Harvey,
1983) - Many voids and pockets.
- At least 1 water molecule 15/100 residues.
(Liang Dill, 2001, Bioph J)
28Scaling relationship
- Volume and area scaling
- V 4 ? r3/3 and A 4 ? r2, therefore we
should have - V A3/2
- Protein has linear scaling
- Clustered random sphere with mixed radii (Lorenz
et al, 1993). - Lattice models of simple clusters (Stauffer, 1985)
29- At percolation threshold, V and R of a cluster of
random spheres - V RD, where D 2.5 (Stauffer, 1983 Lorenz et
al 1993) - R ?jd(xj, max xj,min)/2d
- Proteins
- ln V ln R, D 2.47 0.04 (by
nonlinear curve fitting). - Similar to random spheres near percolation
threshold.
By volume-area and volume-size scaling, proteins
are packed more like random spheres than solids.
30Simulating Protein Packing with Off-Lattice Chain
Polymers
- 32-state off-lattice discrete model
- Sequential Monte Carlo and resampling
- 1,000 of conformations of N 2,000
(Zhang, Chen, Tang and Liang, 2003, J. Chem.
Phys.)
31- Proteins are not optimized by evolution to
eliminate voids. - Protein dictated by generic compactness
constraint related to nc.
32Surfaces with unknown functional roles
http//cast.engr.uic.edu
333. How to identify biologically important
pockets and voids from random ones?
- Assessing Local Sequence and Shape Similarity
(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
34Binding Site Pocket Sparse Residues, Long Gaps
- ATP Binding cAMP Dependet Protein Kinase (1cdk)
- Tyr Protein Kinase c-src (2src)
1cdk.A 49LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEH
TLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGR
FSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGF
AKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPP
FFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG
VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNF327 1cdk.A
_p 49LGTGSFGRV A K
V MEYV E
K EN L
TD
F 2src.m 273LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAF
LQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETG
KYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD4
04 2src.m_p 273LGQGCFGEV A K
V TEYM GS D
D R AN L AD
Low overall sequence identity 13
35High Sequence Similarity of Pocket Residues
1cdk cAMP Dependent Protein Kinase
2src Tyr Protein Kinase c-src
- 1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24
- 2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26
- ..
High sequence identity 51
36Sequence Similarity of Surface Pockets
- Similarity detection
- Dynamic programming SSEARCH (Pearson, 1998)
- BLOSUM50 scoring matrix (Henikoff, 1994).
- Not identity.
- Order Dependent Sequence Pattern.
- Statistics of Null Model
- Gapless local alignment Extreme Value
Distribution - (Altschul Karlin, 90)
- Alignment with gaps (Altschul, Bundschuh,
Olsen Hwa, 01)
? Statistical Significance !
37Approximation with EVD distribution (Pearson,
1998, JMB)
- Kolmogorov-Smirnov Test
- Estimate K and ??parameters.
- Estimation of E-value
- Estimate p value of observed Smith-Waterman score
by EVD. - Estimate E-value
(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
38Shape Similarity Measure
- cRMSD (coordinate root mean square distance)
- oRMSD (Orientational RMSD)
- Place a unit sphere S2 at center of mass x0 2
R3 - Map each residue x 2 R3 to a unit vector on S2
- f x (x, y, z)T ? u (x - x0) / x -
x0 - Measuring RMSD between two sets of unit vectors.
(cf. uRMSD by Kedem and Chew, 2002)
39Statistical Significance of Shape Similarity
- Estimate the probability p of obtaining a
specific cRMSD or oRMSD value for random pockets
with Nres - EVD and other parametric distributions not
accurate. - Randomly select 2 pockets.
- Calculate cRMSD for Nres randomly selected
residues - Also calculate oRSMD
(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
40Surprising Surface Similarity
- Conserved residues both important in polypeptide
binding - Both pockets undergo conformational changes upon
binding
414. Evolutionary pattern of voids and pockets
(Tseng and Liang, 2006, Mol Biol Evo, 23421)
42Scoring Matrix
- A scoring matrix is critical
- Determines similarity between residues and hence
statistical significance. - Derived from evolutionary history of proteins
sharing the same function. - Existing approach
- PAM and BLOSUM heuristics.
- Position specific weight matrix.
- Entropy/relative entropy for full proteins or
domains. - Our approach
- Evolution Explicit phylogenetic tree.
- Model Continuous time Markov process.
- Geometry Evolution of only residues located in
the binding region. - Bayesian Markov chain Monte Carlo.
43Model Continuous time Markov process for
substitution
20 20 rate matrix Q for the instantaneous
substitution rates of 20 amino acid
residues
- Transition probability matrix can be derived
from Q
(Felsenstein, 1983 Yang 1994 Whelan and
Goldman, 2000 Tseng and Liang, 2004)
44Likelihood function of a given phylogeny
- Given a set of multiple-aligned sequences S
(x1, x2, ?, xs) and a phylogenetic tree T ( V,
E ), - A column xh at poisition h is represented as
- The Likelihood function of observing these
sequences is
45Estimation of instantaneous rates Q
- Posterior probability of rate matrix given the
sequences and tree
- Bayesian estimation of posterior mean of rates
in Q
E?(Q) s Q ? (Q S, T) d Q,
- Estimated by Markov chain Monte Carlo.
46Markov chain Monte Carlo
- Detailed balance samples target distribution
after convergency.
- Metropolis-Hastings Algorithm
- Collect data from m acceptant samples
E?(Q) ¼ ?i1m Qi / m ¼ s Q ? (Q S, T ) d Q.
47Move Set
- Two types of moves s1, s2
- Transition matrix between two
- types of moves
- Acceptance ratio
- Individual moves 50-66
- Block moves lt10
48Validation by simulation
- Generate 16 artificial sequences from a known
tree and known rates (JTT model) - Carboxypeptidase A2 precursor as ancestor, length
147 - Goal recovering the substitution rates
Phylogenetic tree used to generate 16 sequences
Estimations from two initial conditions are very
similar to the true values of residue
substitution rates.
Convergence of the Markov chain
49Qauntifying estimation error
(Mayrose et al, 2004, Mol Biol Evo)
- Relative contribution
- Weighted error in contribution
- Weighted mean square error (MSE )
50Accurate Estimation with gt 20 residues and random
initial values
Accurate when gt 20 residues in length.
Distribution of MSE of estimated rates starting
from 50 sets of random initial values. All MSE
lt 0.00075.
51Evolutionary rates of binding sites and other
regions are different
- Residues on protein functional surface experience
different selection pressure. - Estimated substitution rate matrices of amylase
- Functional surface residues.
- The remaining surface,
- The interior residues
- All surface residues.
Sij (i, j) are residues shown in the same column
of MSA defined as Sampled Pairs and Sij are
estimated by Baysian MCMC
52Rate matrix Q and residue similarity score
- Transforming rate matrix Q to Altschul-style
similarity scoring matrix B
where mij(t) is the joint probability of
observing both residue type i and j at the two
nodes separated by time t and ??is a scalar.
- Property of scoring function
535. Improved functional prediction
54Example 1 Finding alpha amylase by matching
pocket surfaces
- Challenging
- amylases often have low overall sequence
identity (lt25).
- 1bag, pocket 60 B. subtilis
- 14 sequences, none with structures, 2 are
hypothetical
- 1bg9 Barley
- 9 sequences, none with structures.
55Criteria for declaring similar functional surface
to a matched surface
- Search gt2million surfaces with a template
surface. - Shapes have to be very similar
- p-value for cRMSD lt 10-3 .
- Customized scoring matrices of 300 different time
intervals. - The most similar surface has nmax of matrices
capable of finding this homologous surface. - Declare a hit if gt1/3 nmax of matrices give
positive results.
56Probability of enzyme functional class
Probability of query protein belonging to enzyme
class i P(i) ?t E.C.i(t)/Nt E.C.i
(t) Number of PDB hits belonging to Enzyme Class
i using matrix of time distance t. Nt Total
number of PDB hits with E.C. number using matrix
of time distance t.
Number of hits and probability of
functional class.
57Results for Amylase
Query B. subtilis Barley 1bag 1bg9
- 1bag found 58 PDB structures.
- 1bg9 found 48 PDB structures.
- Altogether 69
- All belong to amylase (EC 3.2.1.1)
- Comparison
- Annotated enzyme structure database (Thorton) 75.
Hits human 1b2y 1u2y 22 23
58Cross-reaction profile
- E.C.3.2.1.1 Glycosidases, i.e. enzymes
hydrolyzing O- and S-glycosyl-gtAlpha-amylase - Reaction Endohydrolysis of 1,4-alpha-glucosidic
linkages in oligosaccharides and polysaccharides.
- Acts on starch, glycogen and related
polysaccharides and oligosaccharides in a random
manner - E.C.2.4.1.19 Glycosyltransferases
- -gtHexosyltransferases
- -gtCyclomaltodextrin glucanotransferase
- also known as
- Cyclodextringlycosyltransferase
- Reaction Degrades starch to cyclodextrins by
formation of a 1,4-alpha-D- glucosidic bond - Degrades starch to cyclodextrins by formation of
a 1,4-alpha-D- glucosidic bond
Cross-reactivity
59Comparison with others
- Benchmark data
- Enzyme Structure Database (ESD) 75 structures
- template our methodsour matrix our
methodsJTT psi-blast - 1bag 58 52 45
- 1bg9 48 8 21
- Psi-blast does not contain information about
which surface region, active residues, and
geometry - Contains many uninterpretable false positives.
- We do better!
- Ssearch 32 structures found.
60Example 2 UCSF Structure-Function Linkage
Database
- A human-curated database
- Links evolutionarily related sequences and
structures of enzymes to their chemical
reactions. - Correlates conserved active site residues with
specific partial reactions that all members of a
superfamily perform. - Gold standard.
- http//sfld.rbvi.ucsf.edu,
- Patricia Babbitt
61SFLD protein familis
- Generate canonical templates for the four
families with gt7 structures
62Our results
- For all 4 entries in SFLD with gt7 structures
- template Us SFLD ESD
- 1qh9 8 8 8
- 2ada 23 17 23
- 1ebh 22 20 22
- 1kw9 18 16 18
Psi-blast many false positives, and no
inforamtion about functional site
63Example 3 Large Scale Prediction of Protein
Functions
Laskowski, Thornton, JMB, 05, 351614-26
EBI CATH domain, not function
- 110 protein families
- Each points on the curve corresponds to p-values
of various cRMSD cutoffs - Accuracy 92 (EBI 75)
Helmer-Citterich, M et al (BMC Bioinformat.
2005) Russell RB. (JMB 2003) Sternberg MJ
Skolnick, J Lichtarg, O (JMB2003) Ben-Tal, N and
Pupko, T ( ConSurf )
64Large protein family Canonical functional
surfaces of amylase
- Example three canonical binding surfaces for
amylase - Can find all amylases in ESD.
65Large protein family Serine/Threonine protein
kinase
- EC 2.7.1.37 Protein kinase
- Glycogen synthase A kinase.
- Hydroxyalkyl-protein kinase.
- Phosphorylase B kinase kinase.
- 243 PDB entries
- (mixed true positive false positive )
66Canonical representatives of Ser/Thr Kinases
active sites
- 1kv2 42 true positive,
- Average length 50, Average Area 1193.15, Average
Volume 1999.11 Nonpolar 0.44 Aromatic 0.08 Polar
0.19 Neg Charged 0.12 Pos Charged 0.16 -
- 1bmk54 true positive,
- Average length 45, Average Area 1070.20, Average
Volume 1781.42 Nonpolar 0.43 Aromatic 0.09 Polar
0.19 Neg Charged 0.14 Pos Charged 0.15 - 1ol7115 true positive,
- Average length 41, Average Area 951.40, Average
Volume 1525.28 Nonpolar 0.42 Aromatic 0.11 Polar
0.18 Neg Charged 0.16 Pos Charged 0.13 - 1erk126 ture positive,
- Average length 38, Average Area 866.58, Average
Volume 1389.57 Nonpolar 0.41 Aromatic 0.10 Polar
0.18 Neg Charged 0.17 Pos Charged 0.14 - 1blx 64 true positive,
- Average length 33, Average Area 744.93, Average
Volume 1136.06 Nonpolar 0.41 Aromatic 0.10 Polar
0.16 Neg Charged 0.17 Pos Charged 0.17
67Canonical representatives
68Library of functional surfaces of enzymes
- With about 3,000 template surfaces, we can build
with confidence a library of 10,000 active site
surfaces for 13,877 structures. - Including protein-protein interactions
- Validated by sensitivity and specificity.
- Often can find all of the binding surfaces of the
same function and nothing else.
69Orphan protein structures without functional
annotation
- Orphan proteins from Structural Genomics.
- No known functions.
- Often sequences homologs are hypothetical
proteins. - Our tasks
- Identify the functional pocket for each
structure. - Predict protein function.
70Inferring biological functions of protein BioH
Protein of unknown functions from
structural genomics
The candidate binding pocket (CASTp id35) of
BioH (1m33) and a similar functional surface
detected from (1p0p, E.C. 3.1.1.8)
The phylogenetic tree of 28 sequences related to
BioH. Many are hypothetical genes.
71BioH
Significant hits predicted for BioH E.C.
3.1.1.8, p 0.31, cholinesterase E.C.
3.4.23.22, p 0.26, aspartic
endopeptidase E.C. 3.4.21.7, p 0.24,
serine endopeptidase E.C. 3.1.4.17, p 0.20,
phosphodiesterase
Experimental Results Carboxyesterase E.C.
3.1.1.1 high Lipase E.C. 3.1.1.3 low
Thioesterase E.C. 3.1.1.5 low
Aminopeptidease E.C. 3.4.11.5 low
72Test data set of structural genomics proteins
with known function
- Our dataset 761 entries.
- Function since discovered 15 , 9 are enzymes.
- Our test set.
- 1B78 STRUCTURAL GENOMICS, PYROPHOSPHATEAS
- 1KS2 STRUCTURAL GENOMICS, ISOMERASE
- 1KTN STRUCTURAL GENOMICS, LYASE
- 1KUT STRUCTURAL GENOMICS, LIGASE
- 1M94 STRUCTURAL GENOMICS, SIGNALING PROTEIN
- 1MZH STRUCTURAL GENOMICS, ALDOLASE
- 1NXU STRUCTURAL GENOMICS, OXIDOREDUCTASE
- 1RLI STRUCTURAL GENOMICS, PROTEIN BINDING
- 1RRZ STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
- 1S7C STRUCTURAL GENOMICS, OXIDOREDUCTASE
- 1SZQ STRUCTURAL GENOMICS, LYASE
- 1T2A STRUCTURAL GENOMICS, LYASE
- 1TIQ STRUCTURAL GENOMICS, TRANSCRIPTION
- 1WU2 STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
- 1XTP STRUCTURAL GENOMICS, TRANSFERASE
73Result
Answer Sheet
1B78 PYROPHOSPHATEAS E.C.5.3.1.6 v 1KS2
ISOMERASE E.C.5.3.1.6 v 1KTN LYASE
E.C.4.1.2.4 v 1KUT LIGASE E.C.6.3.2.6
v 1MZH ALDOLASE E.C.4.1.2.4 v
1NXU OXIDOREDUCTASE NA v 1S7C
OXIDOREDUCTASE E.C.1.2.1.12 v 1SZQ LYASE
E.C.4.2.1.79 v 1T2A LYASE E.C.4.2.1.47
v 1XTP TRANSFERASE NA v
74Future Comprehensive Structure Based Enzyme
Reclassification
- E.C. labels
- E.C. numbers do not directly correlate to
function. - No information about cross-reactivity.
- Can be misleading, as shown by Thornton and
others. - Does not provide key information about active
site residues. - Does not contain evolutionary information of
functional site. - Does not scale up.
- Comprehensive re-classification of all enzymes in
PDB database using cross reactivity profiles.
75Summary
- Delaunay triangulation and protein folding
studies. - Voids and pockets in protein
- Origin.
- Random vs biologically important.
- Finding biologically important voids and pockets
- Sequence, shape and orientation matching.
- Evolutionary pattern of binding pockets
- Continuous Markov process for residue
substitution. - Hypothetical sequences with unknown functions
become important source of information about
evolutionary of functional site. - Bayesian Markov chain Monte Carlo works for
residue rates. - Protein function predictions
- Orphan structures.
? Acknowledgement!
76Collaborators
- Zheng Ouyang, Sema Kachalo, Hsiao-Mei Lu
- Jinfeng Zhang (now postdoc at Harvard),
- Andrew Binkowski (now postdoc at Argonne),
Jeffrey Tseng - Ken Dill (UCSF), Rong Chen (UIC), Chao Tang
(UCSF) - NSF CAREER DBI 0133856
- NIH GM68958
- ONR MURI
- Whitaker Foundation
Acknowledgement
Papers www.uic.edu/jliang
77Identify active sites of enzymes
- Identify canonical surface template of known
functional site for homologous structures. - Binding surface containing annotated key residues
from structure with - Best R-factor, resolution, and pocket has a large
number n of residues - 8lt n lt 70
- One protein family can have gt1 canonical template
surfaces. - Different structures may have different surfaces.
- Find homologs to canonical surfaces alignment of
pocket sequences. - Initially using multiple BLOSUM matrices for
aligning pocket surfaces (Binkowski et al, 03). - Significant improvement with MCMC matrices.
78Scaling relationship
- Size distribution of voids and pockets
- n Nv/Length , at different volume v
- n V-? c0V ? depends only on dimension
- Size distribution of voids and pockets
- Protein ? 1.67, c0 0.9966
- Random spheres ? 1, 1.5, and 1.9 for 2, 3, and
4 dimension - By volume and area scaling, and by size
distribution, proteins are packed more like
random spheres.
794. How to locate functional sites on protein
structures?
- Protein structures with functional annotation
- But no residue information.
- Protein structures without any annotation
- Structural genomics.
80Current knowledge
- 13,877 structures have E.C. labels.
- But often without knowledge of the active site
and key residues. - Some are mislabeled.
- Some have multiple functions.
- We do not know where the functional sites are.
- 13,877 literature search is not feasible.
- Structures with pockets containing annotated key
residues - 6,273 pdb structures out of gt30,000 after
cleaning-up. - By Swiss-Prot and PDB records.
(Characteristics of these known functional
pockets)
81Length distribution and composition of functional
pockets
(a)
(b)
Fig (b). Compare amino acid composition of
functional site pockets with that of JTT protein
sequences. Functionally important residues
His (H), Asp (D), Tyr (Y), Trp (W) and Gly
(G) Phe (F), Asn (N), and Arg (R).
- Fig (a). From 6,273 protein active site pockets,
80 have between 8 and 200 a.a. - The average length 35 residues.
82Length ratio of functional pockets with and
proteins
10 to 80 a.a vs. full length of 100 to 450 a.a.
83Volume of Functional pocket
The mean volume of functional pocket is 1332.95.
Most the volume of functional pocket is less
than 5,000 A3
84Heuristics for initial identification of
functional pocket
- Characteristic composition of functional pockets.
- Length ratio
- For a full length protein of 200 a.a, the
functional pocket length is likely to be 20 60
a.a. - Volume
- Mean volume of the functional pocket is 1300.
(Reduces the amount of expensive computation)
85Example 3 Large Scale Prediction of Protein
Functions
- Enzyme structure database (Thornton)
- Select a set of 100 protein families
- Remove protein-protein interactions
- Candidate pocket has size gt 12 residues
- Choose a template surface for each family
- With good R-factor
- Build rate matrix for each template structure.
86ROC Analysis
- Query with template surface, and collect hits at
different threshold by p-value of cRMSD - p-value thresholds
- (1.0, 0.1, 0.05, 0.01, 0.005, 0.001, 5
10-4 , 10-4, 5 10-5 , 10-5 ). - True Positive Rate (Sensitivity)
- TP/(TPFN)
- False Positive Rate (1-Specificity)
- FP/(TNFP)
87ROC Analysis of 100 Protein Families
Ours
Laskowski et al JMB, 05, 351614-26
- Close to be perfect, by E.C. label
- Random diagonal. Perfect Upper triangle.
- Average sensitivity gt90 at p 10-3.
- Laskowski and Thornton gt75, by CATH domain,
not function.