Department of Computer Science, - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Department of Computer Science,

Description:

SCOP classification of 1B6C. Superfamily: Protein Kinase like (PK like) ... Pick families from SCOP, EC or other classifications ... – PowerPoint PPT presentation

Number of Views:296
Avg rating:3.0/5.0
Slides: 58
Provided by: McMi4
Category:

less

Transcript and Presenter's Notes

Title: Department of Computer Science,


1
Deepak Bandyopadhyay
A Geometric Framework for Robust Nearest Neighbor
Analysis of Protein Structure and Function
  • Department of Computer Science,
  • University of North Carolina at Chapel Hill

2
Outline
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Motivation
Briefly SNAPP packing differences secondary
structure hinges Detail structural fingerprints
for function inference
Methods
Applications
3
Nearest Neighbors
4
Geometric structures on point sets
  • Voronoi Diagram

Delaunay triangulation / tessellation (DT)
  • Input Points
  • Output Neighbors

5
Delaunay tessellation of proteins
quadruplets
  • Represent each amino acid by a point
  • Ca, side-chain centroid, Cb,...
  • Delaunay tetrahedra ?? nearest neighbor
    quadruplets

6
Delaunay Tessellation Applications
SNAPP, four-body statistical potential for
hydrophobic core stability Carter et al, 2001
Decoy discrimination Krishnamoorthy and
Tropsha, 2003 Scoring Ligand-receptor binding
affinity Zhang et al, 2004
Mining frequent substructures in protein families
Huan et al., 2004, 2005 Structure-Based
Function Inference Bandyopadhyay et al, 2005
7
Outline
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Motivation
Methods
Applications
8
Effect of Imprecision on Delaunay
  • If point coordinates are imprecise...
  • What happens to the Delaunay neighbors?
  • Think of 4 nearly co-circular points in 2D.
    Delaunay edges may flip neighbors change.

9
Which applications are affected by instability
of Delaunay ?
Frequent Subgraphs Qualitative,
Discretized Worse affected
Voronoi volumes Quantitative, Continuous Less
affected
  • When people use Delaunay in analysis of protein
    structure, they assume it is robust to
    perturbations!

10
Method 1 Almost-Delaunay (AD) tetrahedra
  • A 4-tuple of points is in AD(e), if, by
    perturbing all points in the set by at most e,
    its circumscribing sphere can become empty.
  • The minimum perturbation required, e, is the AD
    threshold.

Vertex can move within sphere of radius e
Green Delaunay, in AD(0) Red is in AD(e)
11
AD tetrahedra for protein 2ACY, 98 residues,
Cas(colored by threshold DT not shown, for
clarity)
AD tetrahedra my overlap they do not tile space
12
Computing AD thresholds Bandyopadhyay and
Snoeyink, 2004
  • Find the spherical shell of minimum width, using
    a result from computational metrology
    Garcia-Lopez et al, 1998
  • Given a set of points P, a simplex t is AD(e),
    iff its points are contained within 2 concentric
    spheres s.t.
  • difference in radii is 2e, minimum over all
    such concentric spheres
  • inner sphere contains no points of P

2D Example
Code to compute AD edges, triangles, tetrahedra
for 3D points, in C/CGAL (with MATLAB
interface and utilities) is available
fromhttp//www.cs.unc.edu/debug/software
13
Method 2 Delaunay Probability
  • AD(e) captures worst-case deviation in
    coordinates
  • Uncertainty in actual coordinates ? probabilistic
    model
  • Assume each point has Gaussian p.d.f
  • Prob(sphere empty of pi) 1-?(p.d.f of pi inside
    sphere)
  • Probability that tetrahedron abcd is Delaunay
  • integrate over all possible spheres defined by
    a,b,c,d prob(sphere) ?p ?a,b,c,dprob(sphere
    empty of p)
  • AD algorithm makes Delaunay Probability
    computation feasible
  • Delaunay Probability significant only for
    tetrahedra with low e!

1
2
14
Summary of contributions
  • Algorithmic
  • Theory and Algorithm for the general framework
  • Fast and robust implementation for 3D points
  • Application domain
  • Nearest neighbor analysis with imprecision
  • Applications explored
  • Scoring protein packing with a statistical 4-body
    potential (SNAPP)
  • Quantifying packing differences between proteins
    and other structures
  • Assigning secondary structure from Cas
  • Analyzing conformational changes and finding
    hinge residues
  • Finding local packing motifs specific to protein
    families, applied to structure classification,
    and functional inference for structural genomics

15
Outline
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Briefly SNAPP packing differences secondary
structure hingesDetail structural fingerprints
for function inference
Motivation
Methods
Applications
16
Application 1 SNAPP
  • Simplicial Neighborhood Analysis of Protein
    Packing
  • Carter et al, JMB99
  • Residues represented by side-chain centroids
  • Protein structure represented as an aggregate of
    space filling, irregular tetrahedra
  • Unique and objective recognition of nearest
    neighbor residues in sets of four (Quadruplets)

17
Likelihood Scores for 8724 Compositions
Tropsha A, Singh R, Vaisman I, Zheng W. Pac
Symp Biocomput. 614-23 (1996) Dunbrack, R. Culled
PDB http//www.fccc.edu/research/labs/dunbrack/cu
lledpdb.html
18
Likelihood Mapped to hydrophobic core
19
Applications
  • Applications
  • Decoy Discrimination Krishnamoorthy and
    Tropsha, 2003
  • Weighting scheme based on tetrahedron sequence
    topology
  • Conformation change on ligand binding Sherman
    et al, 2003
  • Study of folding simulations Krishnamoorthy and
    Tropsha, 2003
  • Ligand-receptor binding affinity Zhang,
    Golbraikh and Tropsha, 2004
  • Contribution of almost-Delaunay
  • How stable is the SNAPP score computed using
    Delaunay?
  • Compute variants of it using AD and Delaunay
    Probability

20
Results scoring decoys
1
2
3
1. 4state_reduced 2. lattice_ssfit 3. semfold
  • SNAPP with Delaunay probabilities distinguishes
    decoys from native state as well as (even better
    than?) Delaunay-based SNAPP.
  • Hence, the original Delaunay-based score is
    stable

21
Results scoring CASP5 predictions
  • SNAPP with Delaunay probabilities discriminates
    native structures from predictions as well as
    Delaunay-based SNAPP (usually even better).
  • Hence, the original Delaunay-based score is
    stable

Z-score (Rank)
22
Application 2 Packing Differences
  • How does DT change as points are perturbed, for
    different point sets?

sidechain centroids
(2cro 4state_reduced)
23
Stability of the DT in Proteins
Right Number of Delaunay and AD(0.3) tetrahedra
for a sample of predictions to CASP5. Notice
that the native structures, colored green, have
fewer AD tetrahedra for the same number of
Delaunay tetrahedra.
Left Average of AD tetrahedra at low e (lt 0.5
Å ) grows faster for random points than
proteins, as seen in this cumulative
histogram. This suggests that the DT is stable
for small perturbations in proteins
Delaunay
24
Application 3 Secondary structure from Ca
  • AD threshold histogram of a-helixhas unique
    signature that enableshelix assignment from Cas

25
AD secondary structure assignment
  • strong a-helix signal, weaker b-sheet and b-turn
    signals
  • Better accuracy than previous work Wako and
    Yamato, 1998
  • More tolerant to structural and H-bond
    imperfections than DSSP
  • 1bg5, irregular helix on right
  • Applications
  • consensus assignment
  • structure prediction

1bg5
AD
1bg5
DSSP
Above Visual comparison of a-helix, b-sheet
and b-turn assignments in 1BG5 showing an
irregular a-helix detected by AD and not DSSP.
26
Application 4 Conformational Change and Hinges
  • Analysis of conformational change and detection
    of hinges from a few unaligned conformations
    using AD tetrahedra

27
Neighbor Changes on Motion
  • Motion major rearrangements at a few key
    residues, the hinges
  • Model as neighbor changes, rather than large
    dihedral angle changes
  • DT contains no conformational change signal AD
    tetrahedra do
  • In neighborhood of hinge region, neighbor
    relationships change drastically (quantify by
    changes in AD tetrahedra thresholds)
  • Ovotransferrin, threshold color 0, 0.01-0.1,
    0.1-0.5, 0.5-1,1-2
  • Hinge residues from hinge tetrahedra

1TFA SC apo (open) form
1IEJ SC holo (closed) form
28
Comparison with literature
Ovotransferrin hinges from 3 conformations
TrpRS 8 chains, preTS, MD sim
sidechain centroids
  • ? hinge region
  • ? isolated hinge residue
  • Labeled residues are known from literature

29
Application 5 Family-Specific Fingerprints
  • Find residue packing patterns specific to protein
    families, using graph representations with DT/AD
    edges.
  • Use for family classification and functional
    annotation

30
Graph Representation
Proteins
Small
Molecules
Peptide edge Proximity edge
Node label Amino acid type, chemical properties,
Edge label Sequence adjacency or structure
proximity, determined by distance
31
Graph Database Mining
  • Input database of labeled undirected graphs
    threshold 0 lt ? ? 1
  • ? 2/3
  • Output All (connected) frequent subgraphs from
    the graph database.
  • Performance is critical
  • Number of patterns can grow exponentially for
    large and dense graphs
  • Subgraph isomorphism (NP-complete)

32
Subgraph mining algorithms developed in our group
  • Frequent Subgraph Mining ICDM03
  • Canonical Adjacency Matrix (CAM) tree
  • Induced Subgraph Mining RECOMB04
  • Induced subgraphs geometrically more rigid,
    superimposable
  • Miss many useful motifs embedded in a dense
    graph.
  • Maximal frequent subgraph mining SIGKDD04
  • Mines only maximal frequent subgraph (no
    supergraph freqnt)
  • Uses a spanning tree comparison algorithm
  • CliqueHashing and CliqueHashing ISMB05 demo
  • Finding frequent cliques in linear time

33
Three Graph Representations
CD
E(DT) ? E(AD) ? E(CD)
34
Family Specific Fingerprints
  • Frequent occur in gt80 of family proteins
  • Family-specific occur in lt5 of background
    proteins

TRP141
GLY196
CYS42
HIS57
G1
G2
GLY197
CYS42
ALA55
CYS58
Subgraph G1 Not sequence conserved. Useful for
the annotation of structural orphans.
Subgraph G2 Sequence conserved motif
C-x(12)-A-x-H-C Useful for the annotation of both
structural orphans and sequences.
Human Kallikrein 6 (1LO6) Serine Protease family
35
Largest Serine Protease Fingerprint
1LO6
Blue His57-Asp102-Ser195 catalytic triad Grey
others
36
Cyclin Dependent Protein Kinases (structure of
PDB1B6C)
  • 6 residue motif is highlighted in Red
  • ASP(333) is part of the active site
  • Conserved in 18 out of 29 PK proteins.

37
Applications of Family-Specific Fingerprints
  • Functional family inference for Structural
    Genomics
  • Functional family inference for predicted
    structures
  • Functional neighbors and remote structural
    similarity
  • Deriving sequence patterns from fingerprints

38
Motivation
  • Hypothetical proteins from Structural Genomics
  • structure known, function unknown
  • Function has to be inferred from structure
  • Overall fold similarity to structure
    with
  • Local structure similarity known function
  • Overall fold similarity not necessary,sometimes
    misleading
  • Existing local structure methods
  • Search for known functional sites
  • Derive templates by clique detection

39
Related work function inference from local
structure
  • Detecting similarity to known functional sites
  • SiteEngine Shulman-Peleg et al, 2003
  • SURFACE Ferre et al, 2004
  • eF-site Kinoshita and Nakamura, 2004
  • PINTS-weekly Stark, Shkumatov and Russell 2004
  • Detecting functional sites derived from protein
    families
  • FoldMiner Shapiro and Brutlag, 2003
  • Phunctioner Pazos and Sternberg, 2004
  • DRESPAT Wangikar et al, 2003
  • Common structural cliques Milik et al, 2003

geom.hashing
surfacepatches
40
Method for functional inference
  • Pick families from SCOP, EC or other
    classifications
  • Model protein structures by labeled graphs, with
    almost-Delaunay edges defining proximity
  • Enumerate all frequent subgraphs within the
    family using a subgraph mining algorithm
  • Pick frequent subgraphs infrequent in background
    as family-specific fingerprints
  • Search for fingerprints in structure to be
    annotated
  • use an index of graph similarity to speed up
    Ullmans alg.
  • Assign significance of family membership based on
    the fingerprints found.

41
Fast Graph Search Using Local Neighborhood Index
Hard Case Search for 11 subgraphs in the
6500-protein background dataset (hydrophobic,
average 60 occurrences per protein)
4
1
ASP 102
SER 214
5
ALA 196
2
3
ALA 55
HIS 57
Intractable w/o index
42
Function Inference Using Fingerprints
  • Given query structure q
  • Given fingerprints X1 Xm for prospective
    family Fi
  • Say Xq1 Xqn q, is q in Fi?
  • Simple approximation based on fingerprints
  • P-value based on number of BG proteins with more
    fingerprints
  • Accurate Bayesian formula applied to family and
    background probabilities of Xq1 Xqn

43
Advantages of our method
  • Sequence similarity not sensitive enough
  • Global fold similarity misleading
  • Functional site similarity
  • Different functional families sometimes share
    functional sites
  • Exact matching may not be robust
    (distortion/mutation)
  • Clique methods sacrifice generality of patterns
  • Subgraph fingerprints
  • Family-specific, few false positives by
    definition
  • Multiple fingerprints consensus
  • Confidence of family membership

44
Cross-validation of fingerprints
  • Four-fold CV
  • Splitting family members into training and test
    sets
  • Mining fingerprints from training sets
  • Report fraction of training set FPs found in test
    set Eukaryotic Serine Proteases, 59 members,
    824 Triosephosphate Isomerase, 12 members,
    6016 Metallodependent Hydrolases, 17 members,
    4912 .
  • Report false positives and false negatives

large, homo-geneous families
small, diverse families
45
Discriminating the TIM barrels
  • Validation of method all TIM barrel families are
    structurally very similar.

..
..
..
..
..
..
..
..
46
Annotations missed by SCOP 1.65
New serine protease annotations, based on the
number of fingerprints found out of 79 Serine
Protease fingerprints 1op0A (73/79)
1os8A(73/79) 1p57B (73/79) 1s83 (73/79)
1ssx (46/79) 1md8 (45/79). New
Trioseposphate Isomerase (TIM), 1r2r, 1885/1920
fingerprints. Verified in PDB file headers,
literature All the above except 1op0 have been
classified in SCOP 1.67, Feb 2005
47
Structural Genomics Function inference I
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
48
Residues hit by fingerprints
Figures made in VMD
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
Acidic
Basic
Polar
Hphobic
49
Function inference for predictions
  • Check predicted structures against family of
    template
  • SNAPP Fischer et al 2004, SPREK Taylor,
    Jonassen 2004 not family-specific
  • well-packed predictions with wrong fold may
    score high
  • Fingerprints infer the correct functional family,
    even if the template chosen is incorrect.
  • E.g. CASP5 target T0147, PDB 1m65
  • rare (ba)8 fold, putative metallo-dependent
    hydrolase (MDH)
  • 107 predictions ranked 1
  • 50 predictions had 50 or more of 49 MDH FP
  • 51 other families had 4 preds with 50 FP

50
Functional Neighbors
  • Finding families that share some fingerprints
  • Search for family fingerprints in the background
  • Cluster hits for significant enrichment in SCOP,
    GO hierarchy
  • Eg. Find local similarity between remotely
    related SCOP families

1lvl
1lvl
1kew
SCOP NAD(P) binding Rossman fold
SCOP FAD/NAD linked reductase
The DALI Z-score of the two structures is 4.5,
which suggests that they are dissimilar at the
fold level. The pair-wise sequence identity is
16 and there is no local sequence similarity at
the region of the motif.
51
Sequence Patterns from Tertiary Packing
  • Frequent DT quadruplets or subgraph motifs that
    are conserved in sequence order, mapped back to
    sequence ? Sparse Sequence Signatures
  • Evaluate precision/recall by querying SwissProt
  • Overlap with / comparable to PROSITE patterns
  • Joint work with Ruchir Shah

Sequence Motif aa1, aa2, aa3, aa4, d12, d23,
d34 D, S,
G, P, 2, 3, 7
52
Future Work
  • Biological validation of function inference
  • Future applications in bioinformatics
  • Hierarchical family fingerprints to infer
    function for novel folds, with no putative family
    information
  • Tool for template verification in homology
    modeling/fold recognition
  • Augment domain classifications (SCOP) with
    motif-based functions
  • Augment structure neighbor searches (VAST) with
    functional neighbors
  • Robust neighbor relation to accelerate MD, QM
    simulations
  • Improve docking (graph matching, MD)
  • Local similarity search
  • Other geometric computations (Voronoi
    volumes/domains, alpha-shapes,)

53
Thanks to
  • Thesis advisor Dr. Jack Snoeyink (UNC CS)
  • Collaborators in this work
  • Dr. Alexander Tropsha (UNC Pharmacy)
  • Jun (Luke) Huan, Dr. Wei Wang, Dr. Jan Prins (UNC
    CS)
  • Ruchir Shah (UNC Biomolecular Informatics)
  • Dr. Bala Krishnamoorthy (Washington State U.
    Pullman, Math)
  • Dr. Charlie Carter (UNC Biochemistry)
  • Mother nature, for her wonderful imprecision and
    complexity, that is an endless source of problems

54
References
  • References to my publications
  • Bandyopadhyay, D. and J. Snoeyink (2004).
    Almost-Delaunay simplices Nearest neighbor
    relations for imprecise points. ACM-SIAM
    Symposium On Discrete Algorithms (SODA04).
    http//www.cs.unc.edu/debug/papers/AlmDel
  • Bandyopadhyay, D. and J. Snoeyink (2004).
    Almost-Delaunay simplices Robust nearest
    neighbor relations for imprecise points in CGAL.
    Second CGAL User Workshop, 2004. Software
    http//www.cs.unc.edu/debug/software
  • Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack
    Snoeyink, Jan Prins, Alexander Tropsha (2004).
    Finding Protein Family-specific residue packing
    patterns in Protein Structure Graphs. RECOMB
    2004. Invited to Journal of Computational
    Biology, 2005, in press.
  • Bandyopadhyay, Deepak, Alexander Tropsha and Jack
    Snoeyink. A Robust Score for Protein Packing
    using Almost-Delaunay Tetrahedra. 2005, in
    submission.
  • Bandyopadhyay, Deepak, Jun Huan, Jinze Liu, Jan
    Prins, Jack Snoeyink, Wei Wang, and Alexander
    Tropsha. Protein Functional Family Identification
    by Fast Subgraph Isomorphism Using
    Structure-Based Fingerprints Mined from SCOP and
    EC families. 2005, in submission. Poster
    presented at Triangle Biophysics Symposium, 2004.
  • Bandyopadhyay, Deepak, Jack Snoeyink, Alexander
    Tropsha and Charlie Carter. Analysis of Protein
    Conformational Change Using Almost-Delaunay
    Tetrahedra. Manuscript in preparation. Poster
    presented at Pacific Symposium on Biocomputing
    (PSB), Jan. 2005, Big Island of Hawaii.
  • Bandyopadhyay, Deepak, Alexander Tropsha and Jack
    Snoeyink. Analyzing Protein Structure using
    Almost-Delaunay Tetrahedra. UNC-CS Technical
    Report TR03-043, 2003. Poster presented at
    RECOMB 2004, March 2004, San Diego, CA.

55
References
  • Computational geometry methods applied to protein
    structure analysis
  • Gerstein, M., J. Tsai, and M. Levitt (1995). The
    volume of atoms on the protein surface
    Calculated from simulation, using Voronoi
    polyhedra. Journal of Molecular Biology 249(5),
    955966.
  • Tsai, J., R. Taylor, C. Chothia, and M. Gerstein
    (1999). The packing density in proteins Standard
    radii and volumes. Journal of Molecular Biology
    290(1), 253266.
  • Angelov, B., J. Sadoc, R. Jullien, A. Soyer, J.
    Mornon, and J. Chomilier (2002). Nonatomic
    solvent-driven Voronoi tessellation of proteins
    an open tool to analyze protein folds. Proteins
    49(4), 446456.
  • J. Pontius, J. Richelle and S.J. Wodak (1996).
    Deviations from Standard Atomic Volumes as a
    Quality Measure for Protein Crystal Structures.
    Journal of Molecular Biology 264(1), 121-136.
  • H. Edelsbrunner and P. Koehl. The
    weighted-volume derivative of a space-filling
    diagram. PNAS, Mar 2003 100 2203 - 2208.
  • Liang, J. and K. A. Dill (2001). Are proteins
    well-packed? Biophys. J. 81(2), 751766.
  • J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar,
    and S. Subramaniam. Analytical shape computing
    of macromolecules II identification and
    computation of inaccessible cavities inside
    proteins. Proteins, 331829, 1998.
  • H.L. Cheng. Algorithms for Smooth and Deformable
    Surfaces in 3D. Ph.D. Dissertation, University of
    Illinois at Urbana-Champaign, 2002.
  • Y.-E. Ban, H. Edelsbrunner and J. Rudolph.
    Interface surfaces for protein-protein complexes.
    Proc. RECOMB 2004.
  • Wernisch, L., M. Hunting, and S. Wodak (1999).
    Identification of structural domains in proteins
    by a graph heuristic. Proteins 35(3), 338352.
  • Wako, H. and T. Yamato (1998). Novel method to
    detect a motif of local structures in different
    protein conformations. Protein Engineering 11,
    981990.

56
References
  • SNAPP
  • C. W. Carter, B. C. LeFebvre, S. Cammer, A.
    Tropsha, and M. H. Edgell (2001). Four-body
    potentials reveal protein-specific correlations
    to stability changes caused by hydrophobic core
    mutations. Journal of Molecular Biology,
    311(4)625638.
  • B. Krishnamoorthy and A. Tropsha (2003).
    Development of a four-body statistical
    pseudo-potential to discriminate native from
    non-native protein conformations. Bioinformatics,
    19(12).
  • Tropsha, A., Carter, C., Cammer, S. Vaisman, I.
    (2003). Simplicial neighborhood analysis of
    protein packing (SNAPP) a computational
    geometry approach to studying proteins. Meth.
    Enzymol.,374, 509544
  • Hinges
  • Krebs WG, Alexandrov V, Wilson CA, Echols N, Yu
    H, Gerstein M. (2002). Normal mode analysis of
    macromolecular motions in a database framework
    developing mode concentration as a useful
    classifying statistic. Proteins. 2002 Sep
    148(4)682-95.
  • Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001).
    Protein Flexibilty Predictions using Graph Theory
    Proteins 44, 150 - 165.
  • M.F. Thorpe, Ming Lei, A.J. Rader, Donald J.
    Jacobs, and Leslie A. Kuhn (2001). Protein
    Flexibility and Dynamics using Constraint Theory.
    J. Molecular Graphics and Modelling 19, 60-69.
  • Secondary structure
  • Kabsch, W. and C. Sander (1983). Dictionary of
    protein secondary structure pattern recognition
    of hydrogen-bonded and geometrical features.
    Biopolymers 22(12), 25772637.
  • Family-specific motifs
  • Cammer, S. A. and A. Tropsha (2000).
    Identification of sequence specific tertiary
    packing motifs in protein structures using
    Delaunay tessellation. Lecture Notes in
    Computational Science and Engineering. Springer
    Verlag, New York.
  • J. Huan, W. Wang, and J. Prins (2003). Efficient
    Mining of Frequent Subgraphs in the Presence of
    Isomorphism. International Conference on Data
    Mining 03.
  • Jun (Luke) Huan, Wei Wang, Anglinia Washington,
    Jan Prins, Ruchir Shah, Alexander Tropsha (2004).
    Accurate Classification of Protein Structural
    Families Based on Coherent Subgraph Mining. PSB
    2004.
  • Huan, J., Wang, W., Prins, J. Yang, J. (2004b).
    SPIN Mining maximal frequent subgraphs from
    graph databases. SIGKDD 2004

57
Canonical Adjacency Matrix
  • The Canonical Adjacency Matrix (CAM) of a graph G
    is the maximal adjacency matrix for G under a
    total ordering defined on adjacency matrices.

58
CAM Tree
b
d
c
a
b
b
a
c
y
b
x
b
y
a
a
b
y
b
y
c
y
0
d
y
0
59
Chemical Datasets
  • Predictive Toxicology Evaluation Competition
  • Dataset 337 compounds
  • Two class labels positive (180) and negative
    (157)
  • Each chemical graph contains 27 nodes and 27
    edges on average
  • NIH DTP Anti-Viral Screen Test
  • Chemicals are classified to be Confirmed Active
    (CA), Confirmed Moderate Active (CM) and
    Confirmed Inactive (CI) in NIH DTP Anti-Viral
    Screen Test .
  • Dataset contains 423 CA and 1083 CM compounds
  • Each chemical graph contains 25 nodes and 27
    edges on average

60
Performance (Chemical Datasets)
PTE
DTP CA/CM
FFSM and gSpan are the current available most
efficient frequent subgraph mining algorithms
Write a Comment
User Comments (0)
About PowerShow.com