Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions

About This Presentation

Title:

Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions

Description:

Voids and pockets in proteins: packing, folding, evolution, and biological functions – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 88

Provided by: astr4

Category:

more less

Transcript and Presenter's Notes

Title: Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions

1
Voids and pockets in proteins packing, folding,
evolution, and biological functions

Jie Liang
Dept. of Bioengineering
University of Illinois at Chicago

2
Central Dogma of Molecular Biology

DNA Genetic blueprint of life
Fully sequenced Human genome and hundreds of
other genomes
Proteins Working molecules

DNA RNA
Protein
3
Second Central Dogma

Knowledge of protein structures is important.

4
The Universe of Protein Structures

Human genome 3 billion nucleotides
Number of genes 20,000 25,000
Protein families 10,000-30,000
Number of folds 1,000 - 4,000
Currently in PDB lt 1,000 folds
Comparative modeling needs a structural template
with sequence identities gt 30-35
eg. 50 of ORFs and 18 of residues of S.
cerevisiae genome
Structural Genomics populating each fold with
4-5 structures
One for each superfamily at 30-35 sequence
identities.
Fold of a novel gene can be identified
Its structure can then be interpolated by
comparative modeling.

Main chain folds
Rich information about evolution of protein
structures.
May not directly lead to understanding of
function

(SCOP) All beta proteins
ab proteins Ig like beta sandwich HPr
fold
(from Jaroszewski Godzik, ISMB 00)
6
Predicting and characterizing protein functions

Important, but challenging tasks
Needs gt 60-70 sequence identity.
Fold prediction gt20-30 sequence identity.
Proteins from structural genomics often are of
unknown functions.
Sequence homologs are often hypothetical
proteins.

(Rost, 02, JMB Tian Skolnick, 03, JMB)
7
Functional Voids and Pockets
Ras 21
Fts Z
GDP Binding Pockets
8
Outline

Delaunay triangulation of proteins and protein
folding.
Voids and pockets in proteins
Distribution and scaling properties
Origin of voids and pockets
Biologically important voids and pockets
Assessing similarity Sequence, shape,
orientation, and significance.
Evolutionary pattern of voids and pockets
Bayesian Markov chain Monte Carlo.
Predicting protein functions
Orphan proteins from structural genomics

9
1. Protein folding and Delaunay triangulation

Protein folding problem.
Protein sequence automatically fold to its native
shape.
(Anfinsen, 70s)
Time scale of protein motion.

10

Protein-folding rates from a simple model

What factors determine whether a protein will be
a fast or slow folder?
Native state conformations
(Plaxco and Baker, 97, J Mol Biol)

11

Simple Model (Plaxco and Baker)

Relative Contact Order (CO)
L protein length
N number of contacts lt 6 A
? Nij loop length between contacting residues

12
Folding rates of both 2-state and 3-state proteins

Does not work when a large number of proteins are
examined.

2-state folders
2-state folders
3-state folders
13

Alpha Shape of Protein Structure

Identify contacts by alpha edges.
Not by distance cut-off.

14

(Zheng Ouyang)
Contact order by alpha shape
L protein length N total number of residues
with neighbors n the number of neighbors of
residue I Other models also work e.g. zipper
model of Dill et al.
15
Folding dynamics in lattice model with physical
movement

Lattice 2D HP models
Enumerating sequences and conformations.
Exact thermodynamics.
Exact effects of sequence variation.
Folding dynamics
Exact folding dynamics.

(Lau and Dill, 1989)
(Cieplak et al, 98 Banu and Dill, 00)
(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
16
Sequences and conformations

Chain length 16
802,075 conformations, 216 sequences
1,539 HP sequences fold to
unique ground states.
456 structural families
from 1 (low designability)
to 26 (high designability) sequences

17
Physical movement
Allowed moves are physically realizable on 2D
square lattice.
18
Kinetics from master equation
(Cieplak et al, 98 Banu and Dill, 00)
i
j
General solution
where ?i the i-th eigenvalue of matrix M, and
ni corresponding eigenvector. ?0 0 n0 the
Boltzman distribution, ?1 the smallest non-zero
eigenvalue is taken as the protein folding rate.
19
What can we meaure?

Thermodynamic properties
Ground state energy, energy gap.
Heat capacity
Folding temperature (50 of
protein in native state)
Collapse tempeate
Collapse cooperativity
Folding kinetics
Folding rates.

(Klimov Thirumalai, 1998 Chan Dill, 1993)
20
Folding rates and contact energy

Folding rates
Correlated somewhat with energy
R - 0.84
But wide range of folding rates for sequences of
Same ground state energy.
Same energy gap.
Same fold.

? - structures with few sequences Low
designability ? - structures with many
sequences High designability ? - Go model
21
Folding rate and cooperativity

Folding rates
Correlated somewhat with cooperativity
R - 0.62
Thermodynamics does not define folding rates.

? - high designability cluster ? - Go model
22
Landscape properties number of local minima

Landscape roughness
Number of local minima
Folding rates
Excellent correlation.
R - 0.92

? - high designability cluster
23
Time Evolutoin of Conformation Concentration

Matrix exponential for 85,000 by 85,000 matrix M
Analytical solution of master equation
P(t) eMt P(0)
Approximate in Krylov subspace
KmMt, P(0) span (P(0), MtP(0), ?,
(Mt)mP(0))
m small
Solve matrix exponential for small dense m
matrix
Páde rational functions.
10 million time steps.

(Stochastic roadmap selecting conformations. L
aTomb, Kavraki, Amato)
24
Transient states
Transiently accumulating states demonstrate
intermediate-like behavior during the time
evolution. Many local minima are transiently
accumulating, however, they are not obligated
steps along the folding pathway.
(9 orders of time scale conformations of local
minima)
25
Results from our study

Dramatically different folding rates.
Even for sequences that fold to the same native
structure.
Natural proteins are biased by evolution.
(Scalley-Kim and Baker, 2004)
Folding rates
Not determined by native structure properties,
Weakly related to length and thermodynamic
properties
Ground state energy, energy gap, and collapse
cooperativity.
Sequences of same thermodynamics can have very
different folding rates.
Kinematic energy landscape and folding kinetics.
Roughness of such landscape is strongly
correlated with folding rate.

(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
26
2. Voids and pockets in proteins Computation
Shape library
(Mucke and Edelsbrunner, ACM Trans. Graphics.
1994. Edelsbrunner, Disc Comput Geom. 1995.
Edelsbrunner, Facello, and Liang, Discrete
Applied Math. 1998.)
(Binkowski, Adamian, and Liang, J.
Mol. Biol. 332505-526, 2003)
27
Voids and Pockets in Soluble Proteins

Protein interior is solid-like, tightly packed
like a jig-saw puzzle
High packing density (Richards, 1977)
Low compressibility (Gavish, Gratoon, and Harvey,
1983)
Many voids and pockets.
At least 1 water molecule 15/100 residues.

(Liang Dill, 2001, Bioph J)
28
Scaling relationship

Volume and area scaling
V 4 ? r3/3 and A 4 ? r2, therefore we
should have
V A3/2
Protein has linear scaling
Clustered random sphere with mixed radii (Lorenz
et al, 1993).
Lattice models of simple clusters (Stauffer, 1985)

At percolation threshold, V and R of a cluster of
random spheres
V RD, where D 2.5 (Stauffer, 1983 Lorenz et
al 1993)
R ?jd(xj, max xj,min)/2d
Proteins
ln V ln R, D 2.47 0.04 (by
nonlinear curve fitting).
Similar to random spheres near percolation
threshold.

By volume-area and volume-size scaling, proteins
are packed more like random spheres than solids.
30
Simulating Protein Packing with Off-Lattice Chain
Polymers

32-state off-lattice discrete model
Sequential Monte Carlo and resampling
1,000 of conformations of N 2,000

(Zhang, Chen, Tang and Liang, 2003, J. Chem.
Phys.)
31

Proteins are not optimized by evolution to
eliminate voids.
Protein dictated by generic compactness
constraint related to nc.

32
Surfaces with unknown functional roles
http//cast.engr.uic.edu
33
3. How to identify biologically important
pockets and voids from random ones?

Assessing Local Sequence and Shape Similarity

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
34
Binding Site Pocket Sparse Residues, Long Gaps

ATP Binding cAMP Dependet Protein Kinase (1cdk)
Tyr Protein Kinase c-src (2src)

1cdk.A 49LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEH
TLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGR
FSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGF
AKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPP
FFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG
VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNF327 1cdk.A
_p 49LGTGSFGRV A K
V MEYV E
K EN L
TD

F 2src.m 273LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAF
LQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETG
KYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD4
04 2src.m_p 273LGQGCFGEV A K
V TEYM GS D
D R AN L AD
Low overall sequence identity 13
35
High Sequence Similarity of Pocket Residues
1cdk cAMP Dependent Protein Kinase
2src Tyr Protein Kinase c-src

1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24
2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26
..

High sequence identity 51
36
Sequence Similarity of Surface Pockets

Similarity detection
Dynamic programming SSEARCH (Pearson, 1998)
BLOSUM50 scoring matrix (Henikoff, 1994).
Not identity.
Order Dependent Sequence Pattern.
Statistics of Null Model
Gapless local alignment Extreme Value
Distribution
(Altschul Karlin, 90)
Alignment with gaps (Altschul, Bundschuh,
Olsen Hwa, 01)

? Statistical Significance !
37
Approximation with EVD distribution (Pearson,
1998, JMB)

Kolmogorov-Smirnov Test
Estimate K and ??parameters.
Estimation of E-value
Estimate p value of observed Smith-Waterman score
by EVD.
Estimate E-value

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
38
Shape Similarity Measure

cRMSD (coordinate root mean square distance)
oRMSD (Orientational RMSD)
Place a unit sphere S2 at center of mass x0 2
R3
Map each residue x 2 R3 to a unit vector on S2
f x (x, y, z)T ? u (x - x0) / x -
x0
Measuring RMSD between two sets of unit vectors.

(cf. uRMSD by Kedem and Chew, 2002)
39
Statistical Significance of Shape Similarity

Estimate the probability p of obtaining a
specific cRMSD or oRMSD value for random pockets
with Nres
EVD and other parametric distributions not
accurate.
Randomly select 2 pockets.
Calculate cRMSD for Nres randomly selected
residues
Also calculate oRSMD

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
40
Surprising Surface Similarity

Conserved residues both important in polypeptide
binding
Both pockets undergo conformational changes upon
binding

41
4. Evolutionary pattern of voids and pockets
(Tseng and Liang, 2006, Mol Biol Evo, 23421)
42
Scoring Matrix

A scoring matrix is critical
Determines similarity between residues and hence
statistical significance.
Derived from evolutionary history of proteins
sharing the same function.
Existing approach
PAM and BLOSUM heuristics.
Position specific weight matrix.
Entropy/relative entropy for full proteins or
domains.
Our approach
Evolution Explicit phylogenetic tree.
Model Continuous time Markov process.
Geometry Evolution of only residues located in
the binding region.
Bayesian Markov chain Monte Carlo.

43
Model Continuous time Markov process for
substitution
20 20 rate matrix Q for the instantaneous
substitution rates of 20 amino acid
residues

Transition probability matrix can be derived
from Q

(Felsenstein, 1983 Yang 1994 Whelan and
Goldman, 2000 Tseng and Liang, 2004)
44
Likelihood function of a given phylogeny

Given a set of multiple-aligned sequences S
(x1, x2, ?, xs) and a phylogenetic tree T ( V,
E ),
A column xh at poisition h is represented as

The Likelihood function of observing these
sequences is

45
Estimation of instantaneous rates Q

Posterior probability of rate matrix given the
sequences and tree

Bayesian estimation of posterior mean of rates
in Q

E?(Q) s Q ? (Q S, T) d Q,

Estimated by Markov chain Monte Carlo.

46
Markov chain Monte Carlo

Proposal function

Detailed balance samples target distribution
after convergency.

Metropolis-Hastings Algorithm

Collect data from m acceptant samples

E?(Q) ¼ ?i1m Qi / m ¼ s Q ? (Q S, T ) d Q.
47
Move Set

Two types of moves s1, s2

Individual moves s1

Transition matrix between two
types of moves

Block moves s2

Acceptance ratio
Individual moves 50-66
Block moves lt10

48
Validation by simulation

Generate 16 artificial sequences from a known
tree and known rates (JTT model)
Carboxypeptidase A2 precursor as ancestor, length
147
Goal recovering the substitution rates

Phylogenetic tree used to generate 16 sequences
Estimations from two initial conditions are very
similar to the true values of residue
substitution rates.
Convergence of the Markov chain
49
Qauntifying estimation error
(Mayrose et al, 2004, Mol Biol Evo)

Relative contribution
Weighted error in contribution
Weighted mean square error (MSE )

50
Accurate Estimation with gt 20 residues and random
initial values
Accurate when gt 20 residues in length.
Distribution of MSE of estimated rates starting
from 50 sets of random initial values. All MSE
lt 0.00075.
51
Evolutionary rates of binding sites and other
regions are different

Residues on protein functional surface experience
different selection pressure.
Estimated substitution rate matrices of amylase
Functional surface residues.
The remaining surface,
The interior residues
All surface residues.

Sij (i, j) are residues shown in the same column
of MSA defined as Sampled Pairs and Sij are
estimated by Baysian MCMC
52
Rate matrix Q and residue similarity score

Transforming rate matrix Q to Altschul-style
similarity scoring matrix B

where mij(t) is the joint probability of
observing both residue type i and j at the two
nodes separated by time t and ??is a scalar.

Property of scoring function

53
5. Improved functional prediction
54
Example 1 Finding alpha amylase by matching
pocket surfaces

Challenging
amylases often have low overall sequence
identity (lt25).

1bag, pocket 60 B. subtilis
14 sequences, none with structures, 2 are
hypothetical

1bg9 Barley
9 sequences, none with structures.

55
Criteria for declaring similar functional surface
to a matched surface

Search gt2million surfaces with a template
surface.
Shapes have to be very similar
p-value for cRMSD lt 10-3 .
Customized scoring matrices of 300 different time
intervals.
The most similar surface has nmax of matrices
capable of finding this homologous surface.
Declare a hit if gt1/3 nmax of matrices give
positive results.

56
Probability of enzyme functional class
Probability of query protein belonging to enzyme
class i P(i) ?t E.C.i(t)/Nt E.C.i
(t) Number of PDB hits belonging to Enzyme Class
i using matrix of time distance t. Nt Total
number of PDB hits with E.C. number using matrix
of time distance t.
Number of hits and probability of
functional class.
57
Results for Amylase
Query B. subtilis Barley 1bag 1bg9

1bag found 58 PDB structures.
1bg9 found 48 PDB structures.
Altogether 69
All belong to amylase (EC 3.2.1.1)
Comparison
Annotated enzyme structure database (Thorton) 75.

Hits human 1b2y 1u2y 22 23
58
Cross-reaction profile

E.C.3.2.1.1 Glycosidases, i.e. enzymes
hydrolyzing O- and S-glycosyl-gtAlpha-amylase
Reaction Endohydrolysis of 1,4-alpha-glucosidic
linkages in oligosaccharides and polysaccharides.
Acts on starch, glycogen and related
polysaccharides and oligosaccharides in a random
manner
E.C.2.4.1.19 Glycosyltransferases
-gtHexosyltransferases
-gtCyclomaltodextrin glucanotransferase
also known as
Cyclodextringlycosyltransferase
Reaction Degrades starch to cyclodextrins by
formation of a 1,4-alpha-D- glucosidic bond
Degrades starch to cyclodextrins by formation of
a 1,4-alpha-D- glucosidic bond

Cross-reactivity
59
Comparison with others

Benchmark data
Enzyme Structure Database (ESD) 75 structures
template our methodsour matrix our
methodsJTT psi-blast
1bag 58 52 45
1bg9 48 8 21
Psi-blast does not contain information about
which surface region, active residues, and
geometry
Contains many uninterpretable false positives.
We do better!
Ssearch 32 structures found.

60
Example 2 UCSF Structure-Function Linkage
Database

A human-curated database
Links evolutionarily related sequences and
structures of enzymes to their chemical
reactions.
Correlates conserved active site residues with
specific partial reactions that all members of a
superfamily perform.
Gold standard.
http//sfld.rbvi.ucsf.edu,
Patricia Babbitt

61
SFLD protein familis

Generate canonical templates for the four
families with gt7 structures

62
Our results

For all 4 entries in SFLD with gt7 structures
template Us SFLD ESD
1qh9 8 8 8
2ada 23 17 23
1ebh 22 20 22
1kw9 18 16 18

Psi-blast many false positives, and no
inforamtion about functional site
63
Example 3 Large Scale Prediction of Protein
Functions
Laskowski, Thornton, JMB, 05, 351614-26
EBI CATH domain, not function

110 protein families
Each points on the curve corresponds to p-values
of various cRMSD cutoffs
Accuracy 92 (EBI 75)

Helmer-Citterich, M et al (BMC Bioinformat.
2005) Russell RB. (JMB 2003) Sternberg MJ
Skolnick, J Lichtarg, O (JMB2003) Ben-Tal, N and
Pupko, T ( ConSurf )
64
Large protein family Canonical functional
surfaces of amylase

Example three canonical binding surfaces for
amylase
Can find all amylases in ESD.

65
Large protein family Serine/Threonine protein
kinase

EC 2.7.1.37 Protein kinase
Glycogen synthase A kinase.
Hydroxyalkyl-protein kinase.
Phosphorylase B kinase kinase.
243 PDB entries
(mixed true positive false positive )

66
Canonical representatives of Ser/Thr Kinases
active sites

1kv2 42 true positive,
Average length 50, Average Area 1193.15, Average
Volume 1999.11 Nonpolar 0.44 Aromatic 0.08 Polar
0.19 Neg Charged 0.12 Pos Charged 0.16
1bmk54 true positive,
Average length 45, Average Area 1070.20, Average
Volume 1781.42 Nonpolar 0.43 Aromatic 0.09 Polar
0.19 Neg Charged 0.14 Pos Charged 0.15
1ol7115 true positive,
Average length 41, Average Area 951.40, Average
Volume 1525.28 Nonpolar 0.42 Aromatic 0.11 Polar
0.18 Neg Charged 0.16 Pos Charged 0.13
1erk126 ture positive,
Average length 38, Average Area 866.58, Average
Volume 1389.57 Nonpolar 0.41 Aromatic 0.10 Polar
0.18 Neg Charged 0.17 Pos Charged 0.14
1blx 64 true positive,
Average length 33, Average Area 744.93, Average
Volume 1136.06 Nonpolar 0.41 Aromatic 0.10 Polar
0.16 Neg Charged 0.17 Pos Charged 0.17

67
Canonical representatives
68
Library of functional surfaces of enzymes

With about 3,000 template surfaces, we can build
with confidence a library of 10,000 active site
surfaces for 13,877 structures.
Including protein-protein interactions
Validated by sensitivity and specificity.
Often can find all of the binding surfaces of the
same function and nothing else.

69
Orphan protein structures without functional
annotation

Orphan proteins from Structural Genomics.
No known functions.
Often sequences homologs are hypothetical
proteins.
Our tasks
Identify the functional pocket for each
structure.
Predict protein function.

70
Inferring biological functions of protein BioH
Protein of unknown functions from
structural genomics
The candidate binding pocket (CASTp id35) of
BioH (1m33) and a similar functional surface
detected from (1p0p, E.C. 3.1.1.8)
The phylogenetic tree of 28 sequences related to
BioH. Many are hypothetical genes.
71
BioH
Significant hits predicted for BioH E.C.
3.1.1.8, p 0.31, cholinesterase E.C.
3.4.23.22, p 0.26, aspartic
endopeptidase E.C. 3.4.21.7, p 0.24,
serine endopeptidase E.C. 3.1.4.17, p 0.20,
phosphodiesterase
Experimental Results Carboxyesterase E.C.
3.1.1.1 high Lipase E.C. 3.1.1.3 low
Thioesterase E.C. 3.1.1.5 low
Aminopeptidease E.C. 3.4.11.5 low
72
Test data set of structural genomics proteins
with known function

Our dataset 761 entries.
Function since discovered 15 , 9 are enzymes.
Our test set.

1B78 STRUCTURAL GENOMICS, PYROPHOSPHATEAS
1KS2 STRUCTURAL GENOMICS, ISOMERASE
1KTN STRUCTURAL GENOMICS, LYASE
1KUT STRUCTURAL GENOMICS, LIGASE
1M94 STRUCTURAL GENOMICS, SIGNALING PROTEIN
1MZH STRUCTURAL GENOMICS, ALDOLASE
1NXU STRUCTURAL GENOMICS, OXIDOREDUCTASE
1RLI STRUCTURAL GENOMICS, PROTEIN BINDING
1RRZ STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
1S7C STRUCTURAL GENOMICS, OXIDOREDUCTASE
1SZQ STRUCTURAL GENOMICS, LYASE
1T2A STRUCTURAL GENOMICS, LYASE
1TIQ STRUCTURAL GENOMICS, TRANSCRIPTION
1WU2 STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
1XTP STRUCTURAL GENOMICS, TRANSFERASE

73
Result
Answer Sheet
1B78 PYROPHOSPHATEAS E.C.5.3.1.6 v 1KS2
ISOMERASE E.C.5.3.1.6 v 1KTN LYASE
E.C.4.1.2.4 v 1KUT LIGASE E.C.6.3.2.6
v 1MZH ALDOLASE E.C.4.1.2.4 v
1NXU OXIDOREDUCTASE NA v 1S7C
OXIDOREDUCTASE E.C.1.2.1.12 v 1SZQ LYASE
E.C.4.2.1.79 v 1T2A LYASE E.C.4.2.1.47
v 1XTP TRANSFERASE NA v
74
Future Comprehensive Structure Based Enzyme
Reclassification

E.C. labels
E.C. numbers do not directly correlate to
function.
No information about cross-reactivity.
Can be misleading, as shown by Thornton and
others.
Does not provide key information about active
site residues.
Does not contain evolutionary information of
functional site.
Does not scale up.
Comprehensive re-classification of all enzymes in
PDB database using cross reactivity profiles.

75
Summary

Delaunay triangulation and protein folding
studies.
Voids and pockets in protein
Origin.
Random vs biologically important.
Finding biologically important voids and pockets
Sequence, shape and orientation matching.
Evolutionary pattern of binding pockets
Continuous Markov process for residue
substitution.
Hypothetical sequences with unknown functions
become important source of information about
evolutionary of functional site.
Bayesian Markov chain Monte Carlo works for
residue rates.
Protein function predictions
Orphan structures.

? Acknowledgement!
76
Collaborators

Zheng Ouyang, Sema Kachalo, Hsiao-Mei Lu
Jinfeng Zhang (now postdoc at Harvard),
Andrew Binkowski (now postdoc at Argonne),
Jeffrey Tseng
Ken Dill (UCSF), Rong Chen (UIC), Chao Tang
(UCSF)
NSF CAREER DBI 0133856
NIH GM68958
ONR MURI
Whitaker Foundation

Acknowledgement
Papers www.uic.edu/jliang
77
Identify active sites of enzymes

Identify canonical surface template of known
functional site for homologous structures.
Binding surface containing annotated key residues
from structure with
Best R-factor, resolution, and pocket has a large
number n of residues
8lt n lt 70
One protein family can have gt1 canonical template
surfaces.
Different structures may have different surfaces.
Find homologs to canonical surfaces alignment of
pocket sequences.
Initially using multiple BLOSUM matrices for
aligning pocket surfaces (Binkowski et al, 03).
Significant improvement with MCMC matrices.

78
Scaling relationship

Size distribution of voids and pockets
n Nv/Length , at different volume v
n V-? c0V ? depends only on dimension
Size distribution of voids and pockets
Protein ? 1.67, c0 0.9966
Random spheres ? 1, 1.5, and 1.9 for 2, 3, and
4 dimension
By volume and area scaling, and by size
distribution, proteins are packed more like
random spheres.

79
4. How to locate functional sites on protein
structures?

Protein structures with functional annotation
But no residue information.
Protein structures without any annotation
Structural genomics.

80
Current knowledge

13,877 structures have E.C. labels.
But often without knowledge of the active site
and key residues.
Some are mislabeled.
Some have multiple functions.
We do not know where the functional sites are.
13,877 literature search is not feasible.
Structures with pockets containing annotated key
residues
6,273 pdb structures out of gt30,000 after
cleaning-up.
By Swiss-Prot and PDB records.

(Characteristics of these known functional
pockets)
81
Length distribution and composition of functional
pockets
(a)
(b)
Fig (b). Compare amino acid composition of
functional site pockets with that of JTT protein
sequences. Functionally important residues
His (H), Asp (D), Tyr (Y), Trp (W) and Gly
(G) Phe (F), Asn (N), and Arg (R).

Fig (a). From 6,273 protein active site pockets,
80 have between 8 and 200 a.a.
The average length 35 residues.

82
Length ratio of functional pockets with and
proteins
10 to 80 a.a vs. full length of 100 to 450 a.a.

83
Volume of Functional pocket
The mean volume of functional pocket is 1332.95.
Most the volume of functional pocket is less
than 5,000 A3
84
Heuristics for initial identification of
functional pocket

Characteristic composition of functional pockets.
Length ratio
For a full length protein of 200 a.a, the
functional pocket length is likely to be 20 60
a.a.
Volume
Mean volume of the functional pocket is 1300.

(Reduces the amount of expensive computation)
85
Example 3 Large Scale Prediction of Protein
Functions

Enzyme structure database (Thornton)
Select a set of 100 protein families
Remove protein-protein interactions
Candidate pocket has size gt 12 residues
Choose a template surface for each family
With good R-factor
Build rate matrix for each template structure.

86
ROC Analysis

Query with template surface, and collect hits at
different threshold by p-value of cRMSD
p-value thresholds
(1.0, 0.1, 0.05, 0.01, 0.005, 0.001, 5
10-4 , 10-4, 5 10-5 , 10-5 ).
True Positive Rate (Sensitivity)
TP/(TPFN)
False Positive Rate (1-Specificity)
FP/(TNFP)

87
ROC Analysis of 100 Protein Families
Ours
Laskowski et al JMB, 05, 351614-26