Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions - PowerPoint PPT Presentation

About This Presentation
Title:

Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions

Description:

Voids and pockets in proteins: packing, folding, evolution, and biological functions – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 88
Provided by: astr4
Category:

less

Transcript and Presenter's Notes

Title: Voids%20and%20pockets%20in%20proteins:%20packing,%20folding,%20evolution,%20and%20biological%20functions


1
Voids and pockets in proteins packing, folding,
evolution, and biological functions
  • Jie Liang
  • Dept. of Bioengineering
  • University of Illinois at Chicago

2
Central Dogma of Molecular Biology
  • DNA Genetic blueprint of life
  • Fully sequenced Human genome and hundreds of
    other genomes
  • Proteins Working molecules

DNA RNA
Protein
3
Second Central Dogma
  • Knowledge of protein structures is important.

4
The Universe of Protein Structures
  • Human genome 3 billion nucleotides
  • Number of genes 20,000 25,000
  • Protein families 10,000-30,000
  • Number of folds 1,000 - 4,000
  • Currently in PDB lt 1,000 folds
  • Comparative modeling needs a structural template
    with sequence identities gt 30-35
  • eg. 50 of ORFs and 18 of residues of S.
    cerevisiae genome
  • Structural Genomics populating each fold with
    4-5 structures
  • One for each superfamily at 30-35 sequence
    identities.
  • Fold of a novel gene can be identified
  • Its structure can then be interpolated by
    comparative modeling.

5
  • Main chain folds
  • Rich information about evolution of protein
    structures.
  • May not directly lead to understanding of
    function

(SCOP) All beta proteins
ab proteins Ig like beta sandwich HPr
fold
(from Jaroszewski Godzik, ISMB 00)
6
Predicting and characterizing protein functions
  • Important, but challenging tasks
  • Needs gt 60-70 sequence identity.
  • Fold prediction gt20-30 sequence identity.
  • Proteins from structural genomics often are of
    unknown functions.
  • Sequence homologs are often hypothetical
    proteins.

(Rost, 02, JMB Tian Skolnick, 03, JMB)
7
Functional Voids and Pockets
Ras 21
Fts Z
GDP Binding Pockets
8
Outline
  • Delaunay triangulation of proteins and protein
    folding.
  • Voids and pockets in proteins
  • Distribution and scaling properties
  • Origin of voids and pockets
  • Biologically important voids and pockets
  • Assessing similarity Sequence, shape,
    orientation, and significance.
  • Evolutionary pattern of voids and pockets
  • Bayesian Markov chain Monte Carlo.
  • Predicting protein functions
  • Orphan proteins from structural genomics

9
1. Protein folding and Delaunay triangulation
  • Protein folding problem.
  • Protein sequence automatically fold to its native
    shape.
  • (Anfinsen, 70s)
  • Time scale of protein motion.

10

Protein-folding rates from a simple model
  • What factors determine whether a protein will be
    a fast or slow folder?
  • Native state conformations
  • (Plaxco and Baker, 97, J Mol Biol)

11

Simple Model (Plaxco and Baker)
  • Relative Contact Order (CO)
  • L protein length
  • N number of contacts lt 6 A
  • ? Nij loop length between contacting residues

12
Folding rates of both 2-state and 3-state proteins
  • Does not work when a large number of proteins are
    examined.

2-state folders
2-state folders
3-state folders
13

Alpha Shape of Protein Structure
  • Identify contacts by alpha edges.
  • Not by distance cut-off.

14

(Zheng Ouyang)
Contact order by alpha shape
L protein length N total number of residues
with neighbors n the number of neighbors of
residue I Other models also work e.g. zipper
model of Dill et al.
15
Folding dynamics in lattice model with physical
movement
  • Lattice 2D HP models
  • Enumerating sequences and conformations.
  • Exact thermodynamics.
  • Exact effects of sequence variation.
  • Folding dynamics
  • Exact folding dynamics.

(Lau and Dill, 1989)
(Cieplak et al, 98 Banu and Dill, 00)
(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
16
Sequences and conformations
  • Chain length 16
  • 802,075 conformations, 216 sequences
  • 1,539 HP sequences fold to
  • unique ground states.
  • 456 structural families
  • from 1 (low designability)
  • to 26 (high designability) sequences

17
Physical movement
Allowed moves are physically realizable on 2D
square lattice.
18
Kinetics from master equation
(Cieplak et al, 98 Banu and Dill, 00)
i
j
General solution
where ?i the i-th eigenvalue of matrix M, and
ni corresponding eigenvector. ?0 0 n0 the
Boltzman distribution, ?1 the smallest non-zero
eigenvalue is taken as the protein folding rate.
19
What can we meaure?
  • Thermodynamic properties
  • Ground state energy, energy gap.
  • Heat capacity
  • Folding temperature (50 of
    protein in native state)
  • Collapse tempeate
  • Collapse cooperativity
  • Folding kinetics
  • Folding rates.

(Klimov Thirumalai, 1998 Chan Dill, 1993)
20
Folding rates and contact energy
  • Folding rates
  • Correlated somewhat with energy
  • R - 0.84
  • But wide range of folding rates for sequences of
  • Same ground state energy.
  • Same energy gap.
  • Same fold.

? - structures with few sequences Low
designability ? - structures with many
sequences High designability ? - Go model
21
Folding rate and cooperativity
  • Folding rates
  • Correlated somewhat with cooperativity
  • R - 0.62
  • Thermodynamics does not define folding rates.

? - high designability cluster ? - Go model
22
Landscape properties number of local minima
  • Landscape roughness
  • Number of local minima
  • Folding rates
  • Excellent correlation.
  • R - 0.92

? - high designability cluster
23
Time Evolutoin of Conformation Concentration
  • Matrix exponential for 85,000 by 85,000 matrix M
  • Analytical solution of master equation
  • P(t) eMt P(0)
  • Approximate in Krylov subspace
  • KmMt, P(0) span (P(0), MtP(0), ?,
    (Mt)mP(0))

  • m small
  • Solve matrix exponential for small dense m
    matrix
  • Páde rational functions.
  • 10 million time steps.

(Stochastic roadmap selecting conformations. L
aTomb, Kavraki, Amato)
24
Transient states
Transiently accumulating states demonstrate
intermediate-like behavior during the time
evolution. Many local minima are transiently
accumulating, however, they are not obligated
steps along the folding pathway.
(9 orders of time scale conformations of local
minima)
25
Results from our study
  • Dramatically different folding rates.
  • Even for sequences that fold to the same native
    structure.
  • Natural proteins are biased by evolution.
  • (Scalley-Kim and Baker, 2004)
  • Folding rates
  • Not determined by native structure properties,
  • Weakly related to length and thermodynamic
    properties
  • Ground state energy, energy gap, and collapse
    cooperativity.
  • Sequences of same thermodynamics can have very
    different folding rates.
  • Kinematic energy landscape and folding kinetics.
  • Roughness of such landscape is strongly
    correlated with folding rate.

(Sëma Kachalo, Hsiao-Mei Lu, and Jie Liang, Phys
Rev Lett, 2006, 96 058105.1-4 )
26
2. Voids and pockets in proteins Computation
Shape library
(Mucke and Edelsbrunner, ACM Trans. Graphics.
1994. Edelsbrunner, Disc Comput Geom. 1995.
Edelsbrunner, Facello, and Liang, Discrete
Applied Math. 1998.)
(Binkowski, Adamian, and Liang, J.
Mol. Biol. 332505-526, 2003)
27
Voids and Pockets in Soluble Proteins
  • Protein interior is solid-like, tightly packed
    like a jig-saw puzzle
  • High packing density (Richards, 1977)
  • Low compressibility (Gavish, Gratoon, and Harvey,
    1983)
  • Many voids and pockets.
  • At least 1 water molecule 15/100 residues.

(Liang Dill, 2001, Bioph J)
28
Scaling relationship
  • Volume and area scaling
  • V 4 ? r3/3 and A 4 ? r2, therefore we
    should have
  • V A3/2
  • Protein has linear scaling
  • Clustered random sphere with mixed radii (Lorenz
    et al, 1993).
  • Lattice models of simple clusters (Stauffer, 1985)

29
  • At percolation threshold, V and R of a cluster of
    random spheres
  • V RD, where D 2.5 (Stauffer, 1983 Lorenz et
    al 1993)
  • R ?jd(xj, max xj,min)/2d
  • Proteins
  • ln V ln R, D 2.47 0.04 (by
    nonlinear curve fitting).
  • Similar to random spheres near percolation
    threshold.

By volume-area and volume-size scaling, proteins
are packed more like random spheres than solids.
30
Simulating Protein Packing with Off-Lattice Chain
Polymers
  • 32-state off-lattice discrete model
  • Sequential Monte Carlo and resampling
  • 1,000 of conformations of N 2,000

(Zhang, Chen, Tang and Liang, 2003, J. Chem.
Phys.)
31
  • Proteins are not optimized by evolution to
    eliminate voids.
  • Protein dictated by generic compactness
    constraint related to nc.

32
Surfaces with unknown functional roles
http//cast.engr.uic.edu
33
3. How to identify biologically important
pockets and voids from random ones?
  • Assessing Local Sequence and Shape Similarity

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
34
Binding Site Pocket Sparse Residues, Long Gaps
  • ATP Binding cAMP Dependet Protein Kinase (1cdk)
  • Tyr Protein Kinase c-src (2src)

1cdk.A 49LGTGSFGRVMLVKHKETGNHFAMKILDKQKVVKLKQIEH
TLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGR
FSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGF
AKRVKGRTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPP
FFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG
VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNF327 1cdk.A
_p 49LGTGSFGRV A K
V MEYV E
K EN L
TD


F 2src.m 273LGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAF
LQEAQVMKKLRHEKLVQLYAVVSEEPIYIV TEYMSKGSLLDFLKGETG
KYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVAD4
04 2src.m_p 273LGQGCFGEV A K
V TEYM GS D
D R AN L AD
Low overall sequence identity 13
35
High Sequence Similarity of Pocket Residues
1cdk cAMP Dependent Protein Kinase
2src Tyr Protein Kinase c-src
  • 1cdk.A LGTGSFGRVAKVMEYV---EKENLTDF 24
  • 2src.m LGQGCFGEVAKVTEYMGSDDRANLAD- 26
  • ..

High sequence identity 51
36
Sequence Similarity of Surface Pockets
  • Similarity detection
  • Dynamic programming SSEARCH (Pearson, 1998)
  • BLOSUM50 scoring matrix (Henikoff, 1994).
  • Not identity.
  • Order Dependent Sequence Pattern.
  • Statistics of Null Model
  • Gapless local alignment Extreme Value
    Distribution
  • (Altschul Karlin, 90)
  • Alignment with gaps (Altschul, Bundschuh,
    Olsen Hwa, 01)

? Statistical Significance !
37
Approximation with EVD distribution (Pearson,
1998, JMB)
  • Kolmogorov-Smirnov Test
  • Estimate K and ??parameters.
  • Estimation of E-value
  • Estimate p value of observed Smith-Waterman score
    by EVD.
  • Estimate E-value

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
38
Shape Similarity Measure
  • cRMSD (coordinate root mean square distance)
  • oRMSD (Orientational RMSD)
  • Place a unit sphere S2 at center of mass x0 2
    R3
  • Map each residue x 2 R3 to a unit vector on S2
  • f x (x, y, z)T ? u (x - x0) / x -
    x0
  • Measuring RMSD between two sets of unit vectors.

(cf. uRMSD by Kedem and Chew, 2002)
39
Statistical Significance of Shape Similarity
  • Estimate the probability p of obtaining a
    specific cRMSD or oRMSD value for random pockets
    with Nres
  • EVD and other parametric distributions not
    accurate.
  • Randomly select 2 pockets.
  • Calculate cRMSD for Nres randomly selected
    residues
  • Also calculate oRSMD

(Binkowski, Adamian, Liang, 2003, JMB,
332505-526)
40
Surprising Surface Similarity
  • Conserved residues both important in polypeptide
    binding
  • Both pockets undergo conformational changes upon
    binding

41
4. Evolutionary pattern of voids and pockets
(Tseng and Liang, 2006, Mol Biol Evo, 23421)
42
Scoring Matrix
  • A scoring matrix is critical
  • Determines similarity between residues and hence
    statistical significance.
  • Derived from evolutionary history of proteins
    sharing the same function.
  • Existing approach
  • PAM and BLOSUM heuristics.
  • Position specific weight matrix.
  • Entropy/relative entropy for full proteins or
    domains.
  • Our approach
  • Evolution Explicit phylogenetic tree.
  • Model Continuous time Markov process.
  • Geometry Evolution of only residues located in
    the binding region.
  • Bayesian Markov chain Monte Carlo.

43
Model Continuous time Markov process for
substitution
20 20 rate matrix Q for the instantaneous
substitution rates of 20 amino acid
residues
  • Transition probability matrix can be derived
    from Q

(Felsenstein, 1983 Yang 1994 Whelan and
Goldman, 2000 Tseng and Liang, 2004)
44
Likelihood function of a given phylogeny
  • Given a set of multiple-aligned sequences S
    (x1, x2, ?, xs) and a phylogenetic tree T ( V,
    E ),
  • A column xh at poisition h is represented as
  • The Likelihood function of observing these
    sequences is

45
Estimation of instantaneous rates Q
  • Posterior probability of rate matrix given the
    sequences and tree
  • Bayesian estimation of posterior mean of rates
    in Q

E?(Q) s Q ? (Q S, T) d Q,
  • Estimated by Markov chain Monte Carlo.

46
Markov chain Monte Carlo
  • Proposal function
  • Detailed balance samples target distribution
    after convergency.
  • Metropolis-Hastings Algorithm
  • Collect data from m acceptant samples

E?(Q) ¼ ?i1m Qi / m ¼ s Q ? (Q S, T ) d Q.
47
Move Set
  • Two types of moves s1, s2
  • Individual moves s1
  • Transition matrix between two
  • types of moves
  • Block moves s2
  • Acceptance ratio
  • Individual moves 50-66
  • Block moves lt10

48
Validation by simulation
  • Generate 16 artificial sequences from a known
    tree and known rates (JTT model)
  • Carboxypeptidase A2 precursor as ancestor, length
    147
  • Goal recovering the substitution rates

Phylogenetic tree used to generate 16 sequences
Estimations from two initial conditions are very
similar to the true values of residue
substitution rates.
Convergence of the Markov chain
49
Qauntifying estimation error
(Mayrose et al, 2004, Mol Biol Evo)
  • Relative contribution
  • Weighted error in contribution
  • Weighted mean square error (MSE )

50
Accurate Estimation with gt 20 residues and random
initial values
Accurate when gt 20 residues in length.
Distribution of MSE of estimated rates starting
from 50 sets of random initial values. All MSE
lt 0.00075.
51
Evolutionary rates of binding sites and other
regions are different
  • Residues on protein functional surface experience
    different selection pressure.
  • Estimated substitution rate matrices of amylase
  • Functional surface residues.
  • The remaining surface,
  • The interior residues
  • All surface residues.

Sij (i, j) are residues shown in the same column
of MSA defined as Sampled Pairs and Sij are
estimated by Baysian MCMC
52
Rate matrix Q and residue similarity score
  • Transforming rate matrix Q to Altschul-style
    similarity scoring matrix B

where mij(t) is the joint probability of
observing both residue type i and j at the two
nodes separated by time t and ??is a scalar.
  • Property of scoring function

53
5. Improved functional prediction
54
Example 1 Finding alpha amylase by matching
pocket surfaces
  • Challenging
  • amylases often have low overall sequence
    identity (lt25).
  • 1bag, pocket 60 B. subtilis
  • 14 sequences, none with structures, 2 are
    hypothetical
  • 1bg9 Barley
  • 9 sequences, none with structures.

55
Criteria for declaring similar functional surface
to a matched surface
  • Search gt2million surfaces with a template
    surface.
  • Shapes have to be very similar
  • p-value for cRMSD lt 10-3 .
  • Customized scoring matrices of 300 different time
    intervals.
  • The most similar surface has nmax of matrices
    capable of finding this homologous surface.
  • Declare a hit if gt1/3 nmax of matrices give
    positive results.

56
Probability of enzyme functional class
Probability of query protein belonging to enzyme
class i P(i) ?t E.C.i(t)/Nt E.C.i
(t) Number of PDB hits belonging to Enzyme Class
i using matrix of time distance t. Nt Total
number of PDB hits with E.C. number using matrix
of time distance t.
Number of hits and probability of
functional class.
57
Results for Amylase
Query B. subtilis Barley 1bag 1bg9
  • 1bag found 58 PDB structures.
  • 1bg9 found 48 PDB structures.
  • Altogether 69
  • All belong to amylase (EC 3.2.1.1)
  • Comparison
  • Annotated enzyme structure database (Thorton) 75.

Hits human 1b2y 1u2y 22 23
58
Cross-reaction profile
  • E.C.3.2.1.1 Glycosidases, i.e. enzymes
    hydrolyzing O- and S-glycosyl-gtAlpha-amylase
  • Reaction Endohydrolysis of 1,4-alpha-glucosidic
    linkages in oligosaccharides and polysaccharides.
  • Acts on starch, glycogen and related
    polysaccharides and oligosaccharides in a random
    manner
  • E.C.2.4.1.19 Glycosyltransferases
  • -gtHexosyltransferases
  • -gtCyclomaltodextrin glucanotransferase
  • also known as
  • Cyclodextringlycosyltransferase
  • Reaction Degrades starch to cyclodextrins by
    formation of a 1,4-alpha-D- glucosidic bond
  • Degrades starch to cyclodextrins by formation of
    a 1,4-alpha-D- glucosidic bond


Cross-reactivity
59
Comparison with others
  • Benchmark data
  • Enzyme Structure Database (ESD) 75 structures
  • template our methodsour matrix our
    methodsJTT psi-blast
  • 1bag 58 52 45
  • 1bg9 48 8 21
  • Psi-blast does not contain information about
    which surface region, active residues, and
    geometry
  • Contains many uninterpretable false positives.
  • We do better!
  • Ssearch 32 structures found.

60
Example 2 UCSF Structure-Function Linkage
Database
  • A human-curated database
  • Links evolutionarily related sequences and
    structures of enzymes to their chemical
    reactions.
  • Correlates conserved active site residues with
    specific partial reactions that all members of a
    superfamily perform.
  • Gold standard.
  • http//sfld.rbvi.ucsf.edu,
  • Patricia Babbitt

61
SFLD protein familis
  • Generate canonical templates for the four
    families with gt7 structures

62
Our results
  • For all 4 entries in SFLD with gt7 structures
  • template Us SFLD ESD
  • 1qh9 8 8 8
  • 2ada 23 17 23
  • 1ebh 22 20 22
  • 1kw9 18 16 18

Psi-blast many false positives, and no
inforamtion about functional site
63
Example 3 Large Scale Prediction of Protein
Functions
Laskowski, Thornton, JMB, 05, 351614-26
EBI CATH domain, not function
  • 110 protein families
  • Each points on the curve corresponds to p-values
    of various cRMSD cutoffs
  • Accuracy 92 (EBI 75)

Helmer-Citterich, M et al (BMC Bioinformat.
2005) Russell RB. (JMB 2003) Sternberg MJ
Skolnick, J Lichtarg, O (JMB2003) Ben-Tal, N and
Pupko, T ( ConSurf )
64
Large protein family Canonical functional
surfaces of amylase
  • Example three canonical binding surfaces for
    amylase
  • Can find all amylases in ESD.

65
Large protein family Serine/Threonine protein
kinase
  • EC 2.7.1.37 Protein kinase
  • Glycogen synthase A kinase.
  • Hydroxyalkyl-protein kinase.
  • Phosphorylase B kinase kinase.
  • 243 PDB entries
  • (mixed true positive false positive )

66
Canonical representatives of Ser/Thr Kinases
active sites
  • 1kv2 42 true positive,
  • Average length 50, Average Area 1193.15, Average
    Volume 1999.11 Nonpolar 0.44 Aromatic 0.08 Polar
    0.19 Neg Charged 0.12 Pos Charged 0.16
  • 1bmk54 true positive,
  • Average length 45, Average Area 1070.20, Average
    Volume 1781.42 Nonpolar 0.43 Aromatic 0.09 Polar
    0.19 Neg Charged 0.14 Pos Charged 0.15
  • 1ol7115 true positive,
  • Average length 41, Average Area 951.40, Average
    Volume 1525.28 Nonpolar 0.42 Aromatic 0.11 Polar
    0.18 Neg Charged 0.16 Pos Charged 0.13
  • 1erk126 ture positive,
  • Average length 38, Average Area 866.58, Average
    Volume 1389.57 Nonpolar 0.41 Aromatic 0.10 Polar
    0.18 Neg Charged 0.17 Pos Charged 0.14
  • 1blx 64 true positive,
  • Average length 33, Average Area 744.93, Average
    Volume 1136.06 Nonpolar 0.41 Aromatic 0.10 Polar
    0.16 Neg Charged 0.17 Pos Charged 0.17

67
Canonical representatives
68
Library of functional surfaces of enzymes
  • With about 3,000 template surfaces, we can build
    with confidence a library of 10,000 active site
    surfaces for 13,877 structures.
  • Including protein-protein interactions
  • Validated by sensitivity and specificity.
  • Often can find all of the binding surfaces of the
    same function and nothing else.

69
Orphan protein structures without functional
annotation
  • Orphan proteins from Structural Genomics.
  • No known functions.
  • Often sequences homologs are hypothetical
    proteins.
  • Our tasks
  • Identify the functional pocket for each
    structure.
  • Predict protein function.

70
Inferring biological functions of protein BioH
Protein of unknown functions from
structural genomics
The candidate binding pocket (CASTp id35) of
BioH (1m33) and a similar functional surface
detected from (1p0p, E.C. 3.1.1.8)
The phylogenetic tree of 28 sequences related to
BioH. Many are hypothetical genes.
71
BioH
Significant hits predicted for BioH E.C.
3.1.1.8, p 0.31, cholinesterase E.C.
3.4.23.22, p 0.26, aspartic
endopeptidase E.C. 3.4.21.7, p 0.24,
serine endopeptidase E.C. 3.1.4.17, p 0.20,
phosphodiesterase
Experimental Results Carboxyesterase E.C.
3.1.1.1 high Lipase E.C. 3.1.1.3 low
Thioesterase E.C. 3.1.1.5 low
Aminopeptidease E.C. 3.4.11.5 low
72
Test data set of structural genomics proteins
with known function
  • Our dataset 761 entries.
  • Function since discovered 15 , 9 are enzymes.
  • Our test set.
  • 1B78 STRUCTURAL GENOMICS, PYROPHOSPHATEAS
  • 1KS2 STRUCTURAL GENOMICS, ISOMERASE
  • 1KTN STRUCTURAL GENOMICS, LYASE
  • 1KUT STRUCTURAL GENOMICS, LIGASE
  • 1M94 STRUCTURAL GENOMICS, SIGNALING PROTEIN
  • 1MZH STRUCTURAL GENOMICS, ALDOLASE
  • 1NXU STRUCTURAL GENOMICS, OXIDOREDUCTASE
  • 1RLI STRUCTURAL GENOMICS, PROTEIN BINDING
  • 1RRZ STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
  • 1S7C STRUCTURAL GENOMICS, OXIDOREDUCTASE
  • 1SZQ STRUCTURAL GENOMICS, LYASE
  • 1T2A STRUCTURAL GENOMICS, LYASE
  • 1TIQ STRUCTURAL GENOMICS, TRANSCRIPTION
  • 1WU2 STRUCTURAL GENOMICS, BIOSYNTHETIC PROTEIN
  • 1XTP STRUCTURAL GENOMICS, TRANSFERASE

73
Result
Answer Sheet
1B78 PYROPHOSPHATEAS E.C.5.3.1.6 v 1KS2
ISOMERASE E.C.5.3.1.6 v 1KTN LYASE
E.C.4.1.2.4 v 1KUT LIGASE E.C.6.3.2.6
v 1MZH ALDOLASE E.C.4.1.2.4 v
1NXU OXIDOREDUCTASE NA v 1S7C
OXIDOREDUCTASE E.C.1.2.1.12 v 1SZQ LYASE
E.C.4.2.1.79 v 1T2A LYASE E.C.4.2.1.47
v 1XTP TRANSFERASE NA v
74
Future Comprehensive Structure Based Enzyme
Reclassification
  • E.C. labels
  • E.C. numbers do not directly correlate to
    function.
  • No information about cross-reactivity.
  • Can be misleading, as shown by Thornton and
    others.
  • Does not provide key information about active
    site residues.
  • Does not contain evolutionary information of
    functional site.
  • Does not scale up.
  • Comprehensive re-classification of all enzymes in
    PDB database using cross reactivity profiles.

75
Summary
  • Delaunay triangulation and protein folding
    studies.
  • Voids and pockets in protein
  • Origin.
  • Random vs biologically important.
  • Finding biologically important voids and pockets
  • Sequence, shape and orientation matching.
  • Evolutionary pattern of binding pockets
  • Continuous Markov process for residue
    substitution.
  • Hypothetical sequences with unknown functions
    become important source of information about
    evolutionary of functional site.
  • Bayesian Markov chain Monte Carlo works for
    residue rates.
  • Protein function predictions
  • Orphan structures.

? Acknowledgement!
76
Collaborators
  • Zheng Ouyang, Sema Kachalo, Hsiao-Mei Lu
  • Jinfeng Zhang (now postdoc at Harvard),
  • Andrew Binkowski (now postdoc at Argonne),
    Jeffrey Tseng
  • Ken Dill (UCSF), Rong Chen (UIC), Chao Tang
    (UCSF)
  • NSF CAREER DBI 0133856
  • NIH GM68958
  • ONR MURI
  • Whitaker Foundation

Acknowledgement
Papers www.uic.edu/jliang
77
Identify active sites of enzymes
  • Identify canonical surface template of known
    functional site for homologous structures.
  • Binding surface containing annotated key residues
    from structure with
  • Best R-factor, resolution, and pocket has a large
    number n of residues
  • 8lt n lt 70
  • One protein family can have gt1 canonical template
    surfaces.
  • Different structures may have different surfaces.
  • Find homologs to canonical surfaces alignment of
    pocket sequences.
  • Initially using multiple BLOSUM matrices for
    aligning pocket surfaces (Binkowski et al, 03).
  • Significant improvement with MCMC matrices.

78
Scaling relationship
  • Size distribution of voids and pockets
  • n Nv/Length , at different volume v
  • n V-? c0V ? depends only on dimension
  • Size distribution of voids and pockets
  • Protein ? 1.67, c0 0.9966
  • Random spheres ? 1, 1.5, and 1.9 for 2, 3, and
    4 dimension
  • By volume and area scaling, and by size
    distribution, proteins are packed more like
    random spheres.

79
4. How to locate functional sites on protein
structures?
  • Protein structures with functional annotation
  • But no residue information.
  • Protein structures without any annotation
  • Structural genomics.

80
Current knowledge
  • 13,877 structures have E.C. labels.
  • But often without knowledge of the active site
    and key residues.
  • Some are mislabeled.
  • Some have multiple functions.
  • We do not know where the functional sites are.
  • 13,877 literature search is not feasible.
  • Structures with pockets containing annotated key
    residues
  • 6,273 pdb structures out of gt30,000 after
    cleaning-up.
  • By Swiss-Prot and PDB records.

(Characteristics of these known functional
pockets)
81
Length distribution and composition of functional
pockets
(a)
(b)
Fig (b). Compare amino acid composition of
functional site pockets with that of JTT protein
sequences. Functionally important residues
His (H), Asp (D), Tyr (Y), Trp (W) and Gly
(G) Phe (F), Asn (N), and Arg (R).
  • Fig (a). From 6,273 protein active site pockets,
    80 have between 8 and 200 a.a.
  • The average length 35 residues.

82
Length ratio of functional pockets with and
proteins
10 to 80 a.a vs. full length of 100 to 450 a.a.

83
Volume of Functional pocket
The mean volume of functional pocket is 1332.95.
Most the volume of functional pocket is less
than 5,000 A3
84
Heuristics for initial identification of
functional pocket
  • Characteristic composition of functional pockets.
  • Length ratio
  • For a full length protein of 200 a.a, the
    functional pocket length is likely to be 20 60
    a.a.
  • Volume
  • Mean volume of the functional pocket is 1300.

(Reduces the amount of expensive computation)
85
Example 3 Large Scale Prediction of Protein
Functions
  • Enzyme structure database (Thornton)
  • Select a set of 100 protein families
  • Remove protein-protein interactions
  • Candidate pocket has size gt 12 residues
  • Choose a template surface for each family
  • With good R-factor
  • Build rate matrix for each template structure.

86
ROC Analysis
  • Query with template surface, and collect hits at
    different threshold by p-value of cRMSD
  • p-value thresholds
  • (1.0, 0.1, 0.05, 0.01, 0.005, 0.001, 5
    10-4 , 10-4, 5 10-5 , 10-5 ).
  • True Positive Rate (Sensitivity)
  • TP/(TPFN)
  • False Positive Rate (1-Specificity)
  • FP/(TNFP)

87
ROC Analysis of 100 Protein Families
Ours
Laskowski et al JMB, 05, 351614-26
  • Close to be perfect, by E.C. label
  • Random diagonal. Perfect Upper triangle.
  • Average sensitivity gt90 at p 10-3.
  • Laskowski and Thornton gt75, by CATH domain,
    not function.
Write a Comment
User Comments (0)
About PowerShow.com