Department of Computer Science,

About This Presentation

Title:

Department of Computer Science,

Description:

SCOP classification of 1B6C. Superfamily: Protein Kinase like (PK like) ... Pick families from SCOP, EC or other classifications ... – PowerPoint PPT presentation

Number of Views:296

Avg rating:3.0/5.0

Slides: 58

Provided by: McMi4

Category:

more less

Transcript and Presenter's Notes

Title: Department of Computer Science,

1
Deepak Bandyopadhyay
A Geometric Framework for Robust Nearest Neighbor
Analysis of Protein Structure and Function

Department of Computer Science,
University of North Carolina at Chapel Hill

2
Outline
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Motivation
Briefly SNAPP packing differences secondary
structure hinges Detail structural fingerprints
for function inference
Methods
Applications
3
Nearest Neighbors
4
Geometric structures on point sets

Voronoi Diagram

Delaunay triangulation / tessellation (DT)

Input Points
Output Neighbors

5
Delaunay tessellation of proteins
quadruplets

Represent each amino acid by a point
Ca, side-chain centroid, Cb,...
Delaunay tetrahedra ?? nearest neighbor
quadruplets

6
Delaunay Tessellation Applications
SNAPP, four-body statistical potential for
hydrophobic core stability Carter et al, 2001
Decoy discrimination Krishnamoorthy and
Tropsha, 2003 Scoring Ligand-receptor binding
affinity Zhang et al, 2004
Mining frequent substructures in protein families
Huan et al., 2004, 2005 Structure-Based
Function Inference Bandyopadhyay et al, 2005
7
Outline
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Motivation
Methods
Applications
8
Effect of Imprecision on Delaunay

If point coordinates are imprecise...
What happens to the Delaunay neighbors?
Think of 4 nearly co-circular points in 2D.
Delaunay edges may flip neighbors change.

9
Which applications are affected by instability
of Delaunay ?
Frequent Subgraphs Qualitative,
Discretized Worse affected
Voronoi volumes Quantitative, Continuous Less
affected

When people use Delaunay in analysis of protein
structure, they assume it is robust to
perturbations!

10
Method 1 Almost-Delaunay (AD) tetrahedra

A 4-tuple of points is in AD(e), if, by
perturbing all points in the set by at most e,
its circumscribing sphere can become empty.
The minimum perturbation required, e, is the AD
threshold.

Vertex can move within sphere of radius e
Green Delaunay, in AD(0) Red is in AD(e)
11
AD tetrahedra for protein 2ACY, 98 residues,
Cas(colored by threshold DT not shown, for
clarity)
AD tetrahedra my overlap they do not tile space
12
Computing AD thresholds Bandyopadhyay and
Snoeyink, 2004

Find the spherical shell of minimum width, using
a result from computational metrology
Garcia-Lopez et al, 1998

Given a set of points P, a simplex t is AD(e),
iff its points are contained within 2 concentric
spheres s.t.
difference in radii is 2e, minimum over all
such concentric spheres
inner sphere contains no points of P

2D Example
Code to compute AD edges, triangles, tetrahedra
for 3D points, in C/CGAL (with MATLAB
interface and utilities) is available
fromhttp//www.cs.unc.edu/debug/software
13
Method 2 Delaunay Probability

AD(e) captures worst-case deviation in
coordinates
Uncertainty in actual coordinates ? probabilistic
model
Assume each point has Gaussian p.d.f
Prob(sphere empty of pi) 1-?(p.d.f of pi inside
sphere)
Probability that tetrahedron abcd is Delaunay
integrate over all possible spheres defined by
a,b,c,d prob(sphere) ?p ?a,b,c,dprob(sphere
empty of p)
AD algorithm makes Delaunay Probability
computation feasible
Delaunay Probability significant only for
tetrahedra with low e!

1
2
14
Summary of contributions

Algorithmic
Theory and Algorithm for the general framework
Fast and robust implementation for 3D points
Application domain
Nearest neighbor analysis with imprecision
Applications explored
Scoring protein packing with a statistical 4-body
potential (SNAPP)
Quantifying packing differences between proteins
and other structures
Assigning secondary structure from Cas
Analyzing conformational changes and finding
hinge residues
Finding local packing motifs specific to protein
families, applied to structure classification,
and functional inference for structural genomics

15
Outline
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Briefly SNAPP packing differences secondary
structure hingesDetail structural fingerprints
for function inference
Motivation
Methods
Applications
16
Application 1 SNAPP

Simplicial Neighborhood Analysis of Protein
Packing
Carter et al, JMB99
Residues represented by side-chain centroids
Protein structure represented as an aggregate of
space filling, irregular tetrahedra
Unique and objective recognition of nearest
neighbor residues in sets of four (Quadruplets)

17
Likelihood Scores for 8724 Compositions
Tropsha A, Singh R, Vaisman I, Zheng W. Pac
Symp Biocomput. 614-23 (1996) Dunbrack, R. Culled
PDB http//www.fccc.edu/research/labs/dunbrack/cu
lledpdb.html
18
Likelihood Mapped to hydrophobic core
19
Applications

Applications
Decoy Discrimination Krishnamoorthy and
Tropsha, 2003
Weighting scheme based on tetrahedron sequence
topology
Conformation change on ligand binding Sherman
et al, 2003
Study of folding simulations Krishnamoorthy and
Tropsha, 2003
Ligand-receptor binding affinity Zhang,
Golbraikh and Tropsha, 2004
Contribution of almost-Delaunay
How stable is the SNAPP score computed using
Delaunay?
Compute variants of it using AD and Delaunay
Probability

20
Results scoring decoys
1
2
3
1. 4state_reduced 2. lattice_ssfit 3. semfold

SNAPP with Delaunay probabilities distinguishes
decoys from native state as well as (even better
than?) Delaunay-based SNAPP.
Hence, the original Delaunay-based score is
stable

21
Results scoring CASP5 predictions

SNAPP with Delaunay probabilities discriminates
native structures from predictions as well as
Delaunay-based SNAPP (usually even better).
Hence, the original Delaunay-based score is
stable

Z-score (Rank)
22
Application 2 Packing Differences

How does DT change as points are perturbed, for
different point sets?

sidechain centroids
(2cro 4state_reduced)
23
Stability of the DT in Proteins
Right Number of Delaunay and AD(0.3) tetrahedra
for a sample of predictions to CASP5. Notice
that the native structures, colored green, have
fewer AD tetrahedra for the same number of
Delaunay tetrahedra.
Left Average of AD tetrahedra at low e (lt 0.5
Å ) grows faster for random points than
proteins, as seen in this cumulative
histogram. This suggests that the DT is stable
for small perturbations in proteins
Delaunay
24
Application 3 Secondary structure from Ca

AD threshold histogram of a-helixhas unique
signature that enableshelix assignment from Cas

25
AD secondary structure assignment

strong a-helix signal, weaker b-sheet and b-turn
signals
Better accuracy than previous work Wako and
Yamato, 1998
More tolerant to structural and H-bond
imperfections than DSSP
1bg5, irregular helix on right
Applications
consensus assignment
structure prediction

1bg5
AD
1bg5
DSSP
Above Visual comparison of a-helix, b-sheet
and b-turn assignments in 1BG5 showing an
irregular a-helix detected by AD and not DSSP.
26
Application 4 Conformational Change and Hinges

Analysis of conformational change and detection
of hinges from a few unaligned conformations
using AD tetrahedra

27
Neighbor Changes on Motion

Motion major rearrangements at a few key
residues, the hinges
Model as neighbor changes, rather than large
dihedral angle changes
DT contains no conformational change signal AD
tetrahedra do
In neighborhood of hinge region, neighbor
relationships change drastically (quantify by
changes in AD tetrahedra thresholds)
Ovotransferrin, threshold color 0, 0.01-0.1,
0.1-0.5, 0.5-1,1-2
Hinge residues from hinge tetrahedra

1TFA SC apo (open) form
1IEJ SC holo (closed) form
28
Comparison with literature
Ovotransferrin hinges from 3 conformations
TrpRS 8 chains, preTS, MD sim
sidechain centroids

? hinge region
? isolated hinge residue
Labeled residues are known from literature

29
Application 5 Family-Specific Fingerprints

Find residue packing patterns specific to protein
families, using graph representations with DT/AD
edges.
Use for family classification and functional
annotation

30
Graph Representation
Proteins
Small
Molecules
Peptide edge Proximity edge
Node label Amino acid type, chemical properties,
Edge label Sequence adjacency or structure
proximity, determined by distance
31
Graph Database Mining

Input database of labeled undirected graphs
threshold 0 lt ? ? 1

? 2/3

Output All (connected) frequent subgraphs from
the graph database.
Performance is critical
Number of patterns can grow exponentially for
large and dense graphs
Subgraph isomorphism (NP-complete)

32
Subgraph mining algorithms developed in our group

Frequent Subgraph Mining ICDM03
Canonical Adjacency Matrix (CAM) tree
Induced Subgraph Mining RECOMB04
Induced subgraphs geometrically more rigid,
superimposable
Miss many useful motifs embedded in a dense
graph.
Maximal frequent subgraph mining SIGKDD04
Mines only maximal frequent subgraph (no
supergraph freqnt)
Uses a spanning tree comparison algorithm
CliqueHashing and CliqueHashing ISMB05 demo
Finding frequent cliques in linear time

33
Three Graph Representations
CD
E(DT) ? E(AD) ? E(CD)
34
Family Specific Fingerprints

Frequent occur in gt80 of family proteins
Family-specific occur in lt5 of background
proteins

TRP141
GLY196
CYS42
HIS57
G1
G2
GLY197
CYS42
ALA55
CYS58
Subgraph G1 Not sequence conserved. Useful for
the annotation of structural orphans.
Subgraph G2 Sequence conserved motif
C-x(12)-A-x-H-C Useful for the annotation of both
structural orphans and sequences.
Human Kallikrein 6 (1LO6) Serine Protease family
35
Largest Serine Protease Fingerprint
1LO6
Blue His57-Asp102-Ser195 catalytic triad Grey
others
36
Cyclin Dependent Protein Kinases (structure of
PDB1B6C)

6 residue motif is highlighted in Red
ASP(333) is part of the active site
Conserved in 18 out of 29 PK proteins.

37
Applications of Family-Specific Fingerprints

Functional family inference for Structural
Genomics
Functional family inference for predicted
structures
Functional neighbors and remote structural
similarity
Deriving sequence patterns from fingerprints

38
Motivation

Hypothetical proteins from Structural Genomics
structure known, function unknown
Function has to be inferred from structure
Overall fold similarity to structure
with
Local structure similarity known function
Overall fold similarity not necessary,sometimes
misleading
Existing local structure methods
Search for known functional sites
Derive templates by clique detection

39
Related work function inference from local
structure

Detecting similarity to known functional sites
SiteEngine Shulman-Peleg et al, 2003
SURFACE Ferre et al, 2004
eF-site Kinoshita and Nakamura, 2004
PINTS-weekly Stark, Shkumatov and Russell 2004
Detecting functional sites derived from protein
families
FoldMiner Shapiro and Brutlag, 2003
Phunctioner Pazos and Sternberg, 2004
DRESPAT Wangikar et al, 2003
Common structural cliques Milik et al, 2003

geom.hashing
surfacepatches
40
Method for functional inference

Pick families from SCOP, EC or other
classifications
Model protein structures by labeled graphs, with
almost-Delaunay edges defining proximity
Enumerate all frequent subgraphs within the
family using a subgraph mining algorithm
Pick frequent subgraphs infrequent in background
as family-specific fingerprints
Search for fingerprints in structure to be
annotated
use an index of graph similarity to speed up
Ullmans alg.
Assign significance of family membership based on
the fingerprints found.

41
Fast Graph Search Using Local Neighborhood Index
Hard Case Search for 11 subgraphs in the
6500-protein background dataset (hydrophobic,
average 60 occurrences per protein)
4
1
ASP 102
SER 214
5
ALA 196
2
3
ALA 55
HIS 57
Intractable w/o index
42
Function Inference Using Fingerprints

Given query structure q
Given fingerprints X1 Xm for prospective
family Fi
Say Xq1 Xqn q, is q in Fi?
Simple approximation based on fingerprints
P-value based on number of BG proteins with more
fingerprints
Accurate Bayesian formula applied to family and
background probabilities of Xq1 Xqn

43
Advantages of our method

Sequence similarity not sensitive enough
Global fold similarity misleading
Functional site similarity
Different functional families sometimes share
functional sites
Exact matching may not be robust
(distortion/mutation)
Clique methods sacrifice generality of patterns
Subgraph fingerprints
Family-specific, few false positives by
definition
Multiple fingerprints consensus
Confidence of family membership

44
Cross-validation of fingerprints

Four-fold CV
Splitting family members into training and test
sets
Mining fingerprints from training sets
Report fraction of training set FPs found in test
set Eukaryotic Serine Proteases, 59 members,
824 Triosephosphate Isomerase, 12 members,
6016 Metallodependent Hydrolases, 17 members,
4912 .
Report false positives and false negatives

large, homo-geneous families
small, diverse families
45
Discriminating the TIM barrels

Validation of method all TIM barrel families are
structurally very similar.

..
..
..
..
..
..
..
..
46
Annotations missed by SCOP 1.65
New serine protease annotations, based on the
number of fingerprints found out of 79 Serine
Protease fingerprints 1op0A (73/79)
1os8A(73/79) 1p57B (73/79) 1s83 (73/79)
1ssx (46/79) 1md8 (45/79). New
Trioseposphate Isomerase (TIM), 1r2r, 1885/1920
fingerprints. Verified in PDB file headers,
literature All the above except 1op0 have been
classified in SCOP 1.67, Feb 2005
47
Structural Genomics Function inference I
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
48
Residues hit by fingerprints
Figures made in VMD
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
Acidic
Basic
Polar
Hphobic
49
Function inference for predictions

Check predicted structures against family of
template
SNAPP Fischer et al 2004, SPREK Taylor,
Jonassen 2004 not family-specific
well-packed predictions with wrong fold may
score high
Fingerprints infer the correct functional family,
even if the template chosen is incorrect.
E.g. CASP5 target T0147, PDB 1m65
rare (ba)8 fold, putative metallo-dependent
hydrolase (MDH)
107 predictions ranked 1
50 predictions had 50 or more of 49 MDH FP
51 other families had 4 preds with 50 FP

50
Functional Neighbors

Finding families that share some fingerprints
Search for family fingerprints in the background
Cluster hits for significant enrichment in SCOP,
GO hierarchy
Eg. Find local similarity between remotely
related SCOP families

1lvl
1lvl
1kew
SCOP NAD(P) binding Rossman fold
SCOP FAD/NAD linked reductase
The DALI Z-score of the two structures is 4.5,
which suggests that they are dissimilar at the
fold level. The pair-wise sequence identity is
16 and there is no local sequence similarity at
the region of the motif.
51
Sequence Patterns from Tertiary Packing

Frequent DT quadruplets or subgraph motifs that
are conserved in sequence order, mapped back to
sequence ? Sparse Sequence Signatures
Evaluate precision/recall by querying SwissProt
Overlap with / comparable to PROSITE patterns
Joint work with Ruchir Shah

Sequence Motif aa1, aa2, aa3, aa4, d12, d23,
d34 D, S,
G, P, 2, 3, 7
52
Future Work

Biological validation of function inference
Future applications in bioinformatics
Hierarchical family fingerprints to infer
function for novel folds, with no putative family
information
Tool for template verification in homology
modeling/fold recognition
Augment domain classifications (SCOP) with
motif-based functions
Augment structure neighbor searches (VAST) with
functional neighbors
Robust neighbor relation to accelerate MD, QM
simulations
Improve docking (graph matching, MD)
Local similarity search
Other geometric computations (Voronoi
volumes/domains, alpha-shapes,)

53
Thanks to

Thesis advisor Dr. Jack Snoeyink (UNC CS)
Collaborators in this work
Dr. Alexander Tropsha (UNC Pharmacy)
Jun (Luke) Huan, Dr. Wei Wang, Dr. Jan Prins (UNC
CS)
Ruchir Shah (UNC Biomolecular Informatics)
Dr. Bala Krishnamoorthy (Washington State U.
Pullman, Math)
Dr. Charlie Carter (UNC Biochemistry)
Mother nature, for her wonderful imprecision and
complexity, that is an endless source of problems

54
References

References to my publications
Bandyopadhyay, D. and J. Snoeyink (2004).
Almost-Delaunay simplices Nearest neighbor
relations for imprecise points. ACM-SIAM
Symposium On Discrete Algorithms (SODA04).
http//www.cs.unc.edu/debug/papers/AlmDel
Bandyopadhyay, D. and J. Snoeyink (2004).
Almost-Delaunay simplices Robust nearest
neighbor relations for imprecise points in CGAL.
Second CGAL User Workshop, 2004. Software
http//www.cs.unc.edu/debug/software
Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack
Snoeyink, Jan Prins, Alexander Tropsha (2004).
Finding Protein Family-specific residue packing
patterns in Protein Structure Graphs. RECOMB
2004. Invited to Journal of Computational
Biology, 2005, in press.
Bandyopadhyay, Deepak, Alexander Tropsha and Jack
Snoeyink. A Robust Score for Protein Packing
using Almost-Delaunay Tetrahedra. 2005, in
submission.
Bandyopadhyay, Deepak, Jun Huan, Jinze Liu, Jan
Prins, Jack Snoeyink, Wei Wang, and Alexander
Tropsha. Protein Functional Family Identification
by Fast Subgraph Isomorphism Using
Structure-Based Fingerprints Mined from SCOP and
EC families. 2005, in submission. Poster
presented at Triangle Biophysics Symposium, 2004.
Bandyopadhyay, Deepak, Jack Snoeyink, Alexander
Tropsha and Charlie Carter. Analysis of Protein
Conformational Change Using Almost-Delaunay
Tetrahedra. Manuscript in preparation. Poster
presented at Pacific Symposium on Biocomputing
(PSB), Jan. 2005, Big Island of Hawaii.
Bandyopadhyay, Deepak, Alexander Tropsha and Jack
Snoeyink. Analyzing Protein Structure using
Almost-Delaunay Tetrahedra. UNC-CS Technical
Report TR03-043, 2003. Poster presented at
RECOMB 2004, March 2004, San Diego, CA.

55
References

Computational geometry methods applied to protein
structure analysis
Gerstein, M., J. Tsai, and M. Levitt (1995). The
volume of atoms on the protein surface
Calculated from simulation, using Voronoi
polyhedra. Journal of Molecular Biology 249(5),
955966.
Tsai, J., R. Taylor, C. Chothia, and M. Gerstein
(1999). The packing density in proteins Standard
radii and volumes. Journal of Molecular Biology
290(1), 253266.
Angelov, B., J. Sadoc, R. Jullien, A. Soyer, J.
Mornon, and J. Chomilier (2002). Nonatomic
solvent-driven Voronoi tessellation of proteins
an open tool to analyze protein folds. Proteins
49(4), 446456.
J. Pontius, J. Richelle and S.J. Wodak (1996).
Deviations from Standard Atomic Volumes as a
Quality Measure for Protein Crystal Structures.
Journal of Molecular Biology 264(1), 121-136.
H. Edelsbrunner and P. Koehl. The
weighted-volume derivative of a space-filling
diagram. PNAS, Mar 2003 100 2203 - 2208.
Liang, J. and K. A. Dill (2001). Are proteins
well-packed? Biophys. J. 81(2), 751766.
J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar,
and S. Subramaniam. Analytical shape computing
of macromolecules II identification and
computation of inaccessible cavities inside
proteins. Proteins, 331829, 1998.
H.L. Cheng. Algorithms for Smooth and Deformable
Surfaces in 3D. Ph.D. Dissertation, University of
Illinois at Urbana-Champaign, 2002.
Y.-E. Ban, H. Edelsbrunner and J. Rudolph.
Interface surfaces for protein-protein complexes.
Proc. RECOMB 2004.
Wernisch, L., M. Hunting, and S. Wodak (1999).
Identification of structural domains in proteins
by a graph heuristic. Proteins 35(3), 338352.
Wako, H. and T. Yamato (1998). Novel method to
detect a motif of local structures in different
protein conformations. Protein Engineering 11,
981990.

56
References

SNAPP
C. W. Carter, B. C. LeFebvre, S. Cammer, A.
Tropsha, and M. H. Edgell (2001). Four-body
potentials reveal protein-specific correlations
to stability changes caused by hydrophobic core
mutations. Journal of Molecular Biology,
311(4)625638.
B. Krishnamoorthy and A. Tropsha (2003).
Development of a four-body statistical
pseudo-potential to discriminate native from
non-native protein conformations. Bioinformatics,
19(12).
Tropsha, A., Carter, C., Cammer, S. Vaisman, I.
(2003). Simplicial neighborhood analysis of
protein packing (SNAPP) a computational
geometry approach to studying proteins. Meth.
Enzymol.,374, 509544
Hinges
Krebs WG, Alexandrov V, Wilson CA, Echols N, Yu
H, Gerstein M. (2002). Normal mode analysis of
macromolecular motions in a database framework
developing mode concentration as a useful
classifying statistic. Proteins. 2002 Sep
148(4)682-95.
Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001).
Protein Flexibilty Predictions using Graph Theory
Proteins 44, 150 - 165.
M.F. Thorpe, Ming Lei, A.J. Rader, Donald J.
Jacobs, and Leslie A. Kuhn (2001). Protein
Flexibility and Dynamics using Constraint Theory.
J. Molecular Graphics and Modelling 19, 60-69.
Secondary structure
Kabsch, W. and C. Sander (1983). Dictionary of
protein secondary structure pattern recognition
of hydrogen-bonded and geometrical features.
Biopolymers 22(12), 25772637.
Family-specific motifs
Cammer, S. A. and A. Tropsha (2000).
Identification of sequence specific tertiary
packing motifs in protein structures using
Delaunay tessellation. Lecture Notes in
Computational Science and Engineering. Springer
Verlag, New York.
J. Huan, W. Wang, and J. Prins (2003). Efficient
Mining of Frequent Subgraphs in the Presence of
Isomorphism. International Conference on Data
Mining 03.
Jun (Luke) Huan, Wei Wang, Anglinia Washington,
Jan Prins, Ruchir Shah, Alexander Tropsha (2004).
Accurate Classification of Protein Structural
Families Based on Coherent Subgraph Mining. PSB
2004.
Huan, J., Wang, W., Prins, J. Yang, J. (2004b).
SPIN Mining maximal frequent subgraphs from
graph databases. SIGKDD 2004

57
Canonical Adjacency Matrix

The Canonical Adjacency Matrix (CAM) of a graph G
is the maximal adjacency matrix for G under a
total ordering defined on adjacency matrices.

58
CAM Tree
b
d
c
a
b
b
a
c
y
b
x
b
y
a
a
b
y
b
y
c
y
0
d
y
0
59
Chemical Datasets

Predictive Toxicology Evaluation Competition
Dataset 337 compounds
Two class labels positive (180) and negative
(157)
Each chemical graph contains 27 nodes and 27
edges on average
NIH DTP Anti-Viral Screen Test
Chemicals are classified to be Confirmed Active
(CA), Confirmed Moderate Active (CM) and
Confirmed Inactive (CI) in NIH DTP Anti-Viral
Screen Test .
Dataset contains 423 CA and 1083 CM compounds
Each chemical graph contains 25 nodes and 27
edges on average