Title: Molecular Similarity Searching Using Atom Environments and Surface Point Environments
1Molecular Similarity Searching Using Atom
Environments and Surface Point Environments
- Andreas Bender
- Unilever Centre for Molecular Informatics
Cambridge University, UK
2Outline
- Objective More efficient searching of chemical
databases - New methods developed to detect molecules with
similar biology One is based on connectivity
(2D), the other on surface points (3D) - Details of the algorithms presented here,
starting with the 2D type - Results Lead Discovery finding new drugs,
finding new chemotypes
3Descriptor Choice
42D Environment around an atom
Assign Sybyl mol2 atom types find
connections find connections to
connections create a tree down to n levels bin
the atom types for each level create a
fingerprint for this atom
Level 0 Level 1 Level 2
N2
Car--Car
Car,H
Car,Car
1
2
1
1
These features are created for every atom in the
molecule
5Feature Selection
- E.g. comparing faces first requires the
identification of key features. - How do we identify these?
- The same applies to molecules.
6B) Information-Gain Feature Selection
- We wish to select the important features.
- To do this we calculate the entropy of the data
as a whole and for each class. - This is used to select those features with the
highest discrimination, e.g. active and inactive
molecules.
7Classification
- The next step is to identify which molecules
belong to which class. - To to this we use a Naïve Bayesian Classifer
using the features (atom environments) we have
identified as being important.
8C) Naïve Bayesian Classifier (classification by
presumptive evidence)
- Include all selected features fi in calculation
of - Ratio gt 1 Class membership 1
- Ratio lt 1 Class membership 2
- F feature vector
- fifeature elements
9Application lead discovery
- Database MDL Drug Data Report (MDDR)
- 957 ligands selected from MDDR
- 49 5HT3 Receptor antagonists, 40 Angiotensin
Converting Enzyme inhibitors (ACE), 111
HMG-Co-Reductase inhibitors (HMG), 134 PAF
antagonists and 49 Thromboxane A2 antagonists
(TXA2) Briem and Lessel, Perspect Drug Discov
Des 2000, 20, 245-264. - A) Hit rate among ten nearest neighbours for each
molecule - B) 20-fold Cross Validation, 5 Molecules for
query generation
10Comparison
Using single molecule query
- Briem and Lessel, Perspectives in Drug Discovery
and Design 2000, 20, 245-264.
11Combining Information in Molecules
- In this method, we can extend the approach by
extracting from a set of molecules those features
having the best information gain - This can describe patterns in molecules much
better than individual cases - The following example shows cross-validated
database searches using combinations of features
from five molecules at a time - Inactives were used in a 50/50 split, no molecule
is in the training and the test set at the same
time
12MDDR ACE cumulative recall plot
Optimal Selection
Random selection
Random Selection
We found about 80 of the active molecules among
the first 10 of the library
13Using Multiple Query Molecules
14Transformation to 3D
- Idea To develop an analogous translationally and
rotationally invariant (TRI) descriptor based on
surface points - Advantage Switching from element atom types to
interaction energies gives more general model -gt
scaffold hopping? - Two parts Interaction fingerprint and shape
description here results using only interaction
fingerprints are shown, shape description under
development - Again information-gain feature selection and the
Naïve Bayesian used for Classification
153D Environment around a surface point solvent
accessible surface
Central Point (Layer 0)
Points in Layer 1
Etc.
16Algorithm
Interaction Energies at Surface Points, one Probe
at a time
Binning Scheme -1.0 -0.45 -0.4 -0.3 -0.1 0.0 0.2
-0.35
-0.35
Surface Point Environment
00010000 01100010 - 011101100
17Relation to other algorithms
- Surface Autocorrelation Averaging of interaction
energies Here a favourable and unfavourable
interaction in a given layer will both remain in
the fingerprint - GRIND continuous variables from GRID entire
field of interaction energies simplified only
maximum product enters descriptor - MaP categorical variables, counts are kept
size description - (In addition the feature selection and scoring
are handled differently)
18Algorithm Flow
19Standard Parameters
- MSMS Probe radius 1.5 Å, Density 0.5 Points/ Å2,
double Van-der-Waals radii for atoms, giving
effectively solvent accessible surface - GRID DRY, C3, N1, N2, O, O- probes, otherwise
standard parameters - Binning Using variable number of layers, 8 bits,
cutoffs were set that equal frequencies are
observed
20L0-4
L0-5
L0-3
L0-2
L0-1
Layer0
21Enrichment Curves Briem (4 Layers, Standard
Settings)
22Comparison
23ACE Binding Site
- Snake venom peptide analog with putative binding
motif to angiotensin used in early compound
design (Cushman et al., Biochemistry (1977), 16,
5484-5491.)
24ACE - Query
25Hits using 2D descriptors Hits 1 to 5
26Hits Using 2D Descriptors Hits 6 to 10
27ACE Selected Features
28Hits using 3D descriptors (10 Hits among top 20,
enrichment 20)
29New scaffolds
30Jacobsson data set
- 110 ER? toxins (ER?t)
- 36 ER? mimics (ER?m)
- 60 Matrix Metalloprotease 3 (MMP3)
- 129 Factor Xa (fXa)
- 54 Acetylcholine esterase (AChE)
- 999 Diverse Compounds from MDDR
- 2/3 for training, 1/3 for testing
- Performance Measure Classification
- Actives and Inactives also used in other methods
- Jacobsson et al, J Med Chem 2003, 46, 5781-5789
31Jacobsson et al., Methods
- Docking with 7 Scoring functions (2 implemented
in ICM, five in Tripos CScore) used (GOLD, ICM,
Glide similar) - Fusion by classical consensus scoring (CScore),
Partial Least Squares Discriminant Analysis
(PLS-DA), Bayesian Classification and rule-based
methods - With exception of ER? large number of close
analogues
32Results
- With the exception of MMP3, superior performance
to other methods (better precision accuracy at
the same recall) - AChE much better than other methods (docking
difficult large pocket, water, multiple binding
sites) - Given the fact that docking takes much time, at
least in some cases (AChE) it seems not to be the
method of choice
33Factor Xa
(Accuracy is overall correct prediction,
precision fraction of correct positive
predictions)
34AChE
- (Accuracy is overall correct prediction,
precision fraction of correct positive
predictions)
35Summary
- 2D Method Performs about as well as other 2D
methods for single molecule searches, outperforms
them by a large margin when combining molecules
(published in J. Chem. Inf. Comput. Sci. (2004)
44, 170-178) - 3D Method Combines high enrichment factors with
scaffold hopping discovery of new chemotypes - Performance (at least in part) due to Bayesian
Classifier, which is able to take multiple
structures and active and inactive information
into account
36Acknowledgements
- Robert C Glen (Unilever Centre, Cambridge, UK)
- Hamse Y Mussa (Unilever Centre, Cambridge, UK)
- Stephan Reiling (Aventis, Bridgewater, USA)
- David Patterson (Tripos)
- Funding
- The Gates Cambridge Trust, Unilever, Tripos
- Chemical Computing Group / ACS Comp Division