Loading...

PPT – Similarity and Diversity Alexandre Varnek, University of Strasbourg, France PowerPoint presentation | free to view - id: 3ccd31-MDQ1N

The Adobe Flash plugin is needed to view this content

Similarity and DiversityAlexandre Varnek,

University of Strasbourg, France

What is similar?

Different spaces, classified by

16 diverse aldehydes...

...sorted by common scaffold

...sorted by functional groups

The Similarity Principle Structurally

similar molecules are assumed to have similar

biological properties

Compounds active as opioid receptors

Structural Spectrum of Thrombin Inhibitors

structural similarity fading away

reference compounds

0.56

0.72

0.53

0.84

0.67

0.52

0.82

0.64

0.39

- Key features in similarity/diversity

calculations

- Properties to describe elements (descriptors,

fingerprints) - Distance measure (metrics)

N-Dimensional Descriptor Space

descriptorn

- Each chosen descriptor adds a dimension to the

reference space - Calculation of n descriptor values produces an

n-dimensional coordinate vector in descriptor

space that determines the position of a molecule

descriptor2

descriptor1

descriptor3

Chemical Reference Space

- Distance in chemical space is used as a measure

of molecular similarity and dissimilarity - Molecular similarity covers only chemical

similarity but also property similarity including

biological activity

DAB

B

A

Distance Metrics in n-D Space

- If two molecules have comparable values in all

the n descriptors in the space, they are located

close to each other in the n-D space. - how to define closeness in space as a measure

of molecular similarity? - distance metrics

Descriptor-based Similarity

- When two molecules A and B are projected into an

n-D space, two vectors, A and B, represent their

descriptor values, respectively. - A (a1,a2,...an)
- B (b1,b2,...bn)
- The similarity between A and B, SAB, is

negatively correlated with thedistance DAB - shorter distance more similar molecules
- in the case of normalized distance(within value

range 0,1), similarity 1 distance

B

DAB

DBC

C

A

DABgtDBC ? SABltSBC

Metrics Properties

- The distance values dAB? 0 dAA dBB 0
- Symmetry properties dAB dBA
- Triangle inequality dAB? dAC dBC

B

DAB

DBC

C

A

Euclidean Distance in n-D Space

- Given two n-dimensional vectors, A and B
- A (a1,a2,...an)
- B (b1,b2,...bn)
- Euclidean distance DAB is defined as
- Example
- A (3,0,1) B (5,2,0)
- DAB

3

Manhattan Distance in n-D Space

- Given two n-dimensional vectors, A and B
- A (a1,a2,...an)
- B (b1,b2,...bn)
- Manhattan distance DAB is defined as
- Example
- A (3,0,1) B (5,2,0)
- DAB

5

Distance Measures (Metrics)

Euclidian distance (x11 - x21) 2 (x12 -

x22)2 1/2 (42 22)1/2 4.472

Manhattan (Hamming) distance x11 - x21 x12

- x22 4 2 6

Sup distance Max (x11 - x21, x12 - x22)

Max (4, 2) 4

Binary Fingerprint

Popular Similarity/Distance Coefficients

- Similarity metrics
- Tanimoto coefficient
- Dice coefficient
- Cosine coefficient
- Distance metrics
- Euclidean distance
- Hamming distance
- Soergel distance

Tanimoto Coefficient (Tc)

- Definition
- value range 0,1
- Tc is also known as Jaccard coefficient
- Tc is the most popular similarity coefficient

Example Tc Calculation

Dice Coefficient

- Definition
- value range 0,1
- monotonic with the Tanimoto coefficient

Cosine Coefficient

- Definition
- Properties
- value range 0,1
- correlated with the Tanimoto coefficient but not

strictly monotonic with it

Hamming Distance

- Definition
- value range 0,N (N, length of the

fingerprint) - also called Manhattan/City Block distance

Soergel Distance

- Definition
- Properties
- value range 0,1
- equivalent to (1 Tc) for binary fingerprints

(No Transcript)

(No Transcript)

Similarity coefficients

Properties of Similarlity and Distance

Coefficients

Metric Properties

- The distance values dAB? 0 dAA dBB 0
- Symmetry properties dAB dBA
- Triangle inequality dAB? dAC dBC

The Euclidean and Hamming distances and the

Tanimoto coefficients (dichotomous variables)

obey all properties. The Tanimoto, Dice and

Cosine coefficients do not obey inequality (3).

Coefficients are monotonic if they produce the

same similarlity ranking

Similarity search

Using bit strings to encode molecular size. A

biphenyl query is compared to a series of

analogues of increasing size. The Tanimoto

coefficient, which is shown next to the

corresponding structure, decreases with

increasing size, until a limiting value

is reached.

D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,

No. 3, 1998, pp. 379-386

Similarity search

Molecular similarity at a range of Tanimoto

coefficient values

D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,

No. 3, 1998, pp. 379-386

Similarity search

The distribution of Tanimoto coefficient values

found in database searches with a range of query

molecules of increasing size and complexity

D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,

No. 3, 1998, pp. 379-386

Molecular Similarity

A comparison of the Soergel and Hamming distance

values for two pairs of structures to illustrate

the effect of molecular size

A R. Leach and V. J. Gillet "An Introduction to

Chemoinformatics" , Kluwer Academic Publisher,

2003

Molecular Similarity

The maximum common subgraph (MCS) between the two

molecules is in bold

Similarity Nbonds(MCS) / Nbonds(query)

A R. Leach and V. J. Gillet "An Introduction to

Chemoinformatics" , Kluwer Academic Publisher,

2003

Activity landscape

How important is a choice of descriptors ?

Inhibitors of acyl-CoAcholesterol

acyltransferase represented with MACCS (a), TGT

(b), and Molprint2D (c) fingerprints.

discontinuous SARs

continuous SARs

- gradual changes in structure result in moderate

changes in activity - rolling hills (G. Maggiora)

- small changes in structure have dramatic effects

on activity - cliffs in activity landscapes

Structure-Activity Landscape Index SALIij

DAij / DSij

DAij (DSij ) is the difference between

activities (similarities) of molecules i and j R.

Guha et al. J.Chem.Inf.Mod., 2008, 48, 646

discontinuous SARs

VEGFR-2 tyrosine kinase inhibitors

- small changes in structure have dramatic effects

on activity - cliffs in activity landscapes
- lead optimization, QSAR

bad news for molecular similarity analysis...

Example of a Classical Discontinuous SAR

Any similarity method must recognize

these compounds as being similar ...

(MACCS Tanimoto similarity)

Adenosine deaminase inhibitors

Libraries design

Goal to select a representative subset from a

large database

Chemical Space

Overlapping similarity radii ? Redundancy Void

regions ? Lack of information

Chemical Space

Void regions ? Lack of information

Chemical Space

No redundancy, no voids ? Optimally diverse

compound library

Subset selection from the libraries

- Clustering
- Dissimilarity-based methods
- Cell-based methods
- Optimisation techniques

Clustering in chemistry

What is clustering?

- Clustering is the separation of a set of objects

into groups such that items in one group are more

like each other than items in a different group - A technique to understand, simplify and

interpret large amounts of multidimensional data - Classification without labels (unsupervised

learning)

Where clustering is used ?

- General
- data mining, statistical data analysis, data

compression, image segmentation, document

classification (information retrieval) - Chemical
- representative sample,
- subsets selection,
- classification of new compounds

Overall strategy

- Select descriptors
- Generate descriptors for all items
- Scale descriptors
- Define similarity measure ( metrics )
- Apply appropriate clustering method to group

the items on basis of chosen descriptors and

similarity measure - Analyse results

Data Presentation

descriptors

molecules

molecules

molecules

Pattern matrix

Proximity matrix

Library contains n molecules, each molecule is

described by p descriptors

dii 0 dij dji

Clustering methods

Single Link

Complete Link

Group Average

Hierarchical

Weighted Gr Av

Monothetic

Centroid

Polythetic

Median

Single Pass

Ward

Jarvis-Patrick

Nearest Neighbour

Mixture Model

Non-hierarchical

Relocation

Topographic

Others

Hierarchical Clustering

divisive

agglomerative

A dendrogram representing an hierarchical

clustering of 7 compounds

Sequential Agglomerative Hierarchical

Non-overlapping (SAHN) methods

Group average

Simple link

Complete link

In the Single Link method, the intercluster

distance is equal to the minimum distance between

any two compounds, one from each cluster.

In the Complete Link method, the intercluster

distance is equal to the furthest distance

between any two compounds, one from each cluster.

The Group Average method measures intercluster

distance as the average of the distances between

all compounds in the two clusters.

Hierarchical Clustering Johnsons method

- The algorithm is an agglomerative scheme that

erases rows and columns in the proximity matrix

as old clusters are merged into new ones.

Step 1. Group the pair of objects into a cluster

Step 2. Update the proximity matrix

Hierarchical Clustering single link

Hierarchical Clustering complete link

Hierarchical Clustering single vs complete

link

Non-Hierarchical Clustering the Jarvis-Patrick

method

- At the first step, all nearest neighbours of each

compound are found by calculating of all paiwise

similarities and sorting according to this

similarlity. - Then, two compounds are placed into the same

cluster if - They are in each others list of m nearest

neighbours. - They have p (where plt m) nearest neighbours in

common. Typical values m 14 p 8. - Pb too many singletons.

Non-Hierarchical Clustering the relocation

methods

- Relocation algorithms involve an initial

assignment of compounds to clusters, which is

then iteratively refined by moving (relocating)

certain compounds from one cluster to another. - Example the K-means method
- Random choise of c seed compounds. Other

compounds are assigned to the nearest seed

resulting in an initial set of c clusters. - The centroides of cluster are calculated. The

objects are re-assigned to the nearest cluster

centroid. - Pb the method is dependent upon the initial

set of cluster centroids.

Efficiency of Clustering Methods

N is the number of compounds and M is the number

of clusters

Validity of clustering

- How many clusters are in the data ?
- Does partitioning match the categories ?
- Where should be the dendrogram be cut ?
- Which of two partitions fit the data better ?

Dissimilarity-Based Compound Selection (DBCS)

- 4 steps basic algorithm for DBCS
- Select a compound and place it in the subset.
- Calculate the dissimilarity between each compound

remaining in the data set and the compounds in

the subset. - Choose the next compound as that which is most

dissimilar to the compounds in subset. - If n lt n0 (n0 being the desired size number of

compounds in the final subset), return to step 2.

Dissimilarity-Based Compound Selection (DBCS)

- Basic algorithm for DBCS
- 1st step selection of the initial compound
- Select it at random
- Choose the molecules which is most

representative (e.g., has the largest sum of

similarlities to other molecules) - Choose the molecules which is most dissimilar

(e.g., has the smallest sum of similarlities to

other molecules).

Dissimilarity-Based Compound Selection (DBCS)

- Basic algorithm for DBCS
- 2nd step calculation of dissimilarity

- Dissimilarity is the opposite of similarity
- (Dissimilarity)i,j 1 (Similarity )i,j
- (where Similarity is Tanimoto, or Dice, or

Cosine, coefficients)

Diversity

- Diversity characterises a set of molecules

Dissimilarity-Based Compound Selection (DBCS)

- Basic algorithm for DBCS
- 3nd step selection the most dissimilar compound

There are several methods to select a diversed

subset containing m compounds

- Basic algorithm for DBCS
- 3nd step selection the most dissimilar compound

3). The Sphere Exclusion Algorithm

- Define a threshold dissimilarity, t
- Select a compound and place it in the subset.
- Remove all molecules from the data set that have

a dissimilarity to the selected molecule of less

than t - Return to step 2 if there are molecules remaining

in the data set.

- The next compound can be selected
- randomly
- using MinMax-like method

DBCS Subset selection from the libraries

Cell-based methods

- Cell-based or Partitioning methods operated

within a predefined low-dimentional chemical

space. - If there are K axes (properties) and each is

devided into bi bins, then the number of cells

Ncells in the multidimentianal space is

Cell-based methods

The construction of 2-dimentional chemical space.

LogP bins lt0, 0-0.3, 3-7 and gt7 MW bins

0-250, 250-500, 500-750, gt 750.

Cell-based methods

- A key feature of Cell-based methods is that they

do not requere the calculation of paiwise

distances Di,j between compounds the chemical

space is defined independently of the molecules

that are positioned within it.

- Advantages of Cell-based methods
- Empty cells (voids) or cells with low ocupancy

can be easily identified. - The diversity of different subsets can be easily

compared by examining the overlap in the cells

occupied by each subset.

Main pb Cell-based methods are restricted to

relatively low-dimentional space

Optimisation techniques

- DBCS methods prepare a diverse subset selecting

interatively ONE molecule a time. - Optimisation techniques provide an efficient ways

of sampling large search spaces and selection of

diversed subsets

Optimisation techniques

- Example Monte-Carlo search
- Random selection of an initial subset and

calculation of its diversity D. - 2. A new subset is generated from the first

by replacing some of its compounds with other

randomly selected. - 3. The diversity of the new subset Di1 is

compared with Di - if DD Di1 - Di gt 0, the new set is

accepted - if DD lt 0, the probability of acceptence

depends on the Metropolis condition, exp(- DD /

kT).

Scaffolds and Frameworks

Frameworks

Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,

2887-2893

Frameworks

Dissection of a molecule according to Bemis and

Murcko. Diazepam contains three sidechains and

one framework with two ring systems and a

zero-atom linker.

G. Schneider, P. Schneider, S. Renner, QSAR

Comb.Sci. 25, 2006, No.12, 1162 1171

Graph Frameworks for Compounds in the CMC

Database (Numbers Indicate Frequency of

Occurrence)

Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,

2887-2893

Scaffolds et Frameworks

- Lalgorithme de Bemis et Murcko de génération de

framework - les hydrogènes sont supprimés,
- les atomes avec une seule liaison sont supprimés

successivement, - le scaffold est obtenu,
- tous les types datomes sont définis en tant que

C et tous les types de liaisons - sont définis en tant que simples liaisons, ce qui

permet dobtenir le framework.

Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,

2887-2893

Contrairement à la méthode de Bemis et Murcko, A.

Monge a proposé de distinguer les liaisons

aromatiques et non aromatiques (thèse de

doctorat, Univ. Orléans, 2007)

Scaffold-Hopping How Far Can You Jump? G.

Schneider, P. Schneider, S. Renner, QSAR

Comb.Sci. 25, 2006, No.12, 1162 1171

The Scaffold Tree - Visualization of the Scaffold

Universe by Hierarchical Scaffold

Classification A. Schuffenhauer, P. Ertl, S.

Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J.

Chem. Inf. Model., 2007, 47 (1), 47-58

Scaffold tree for the results of pyruvate kinase

assay. Color intensity represents the ratio of

active and inactive molecules with these

scaffolds.

A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel,

M. A. Koch, and H.Waldmann J. Chem. Inf. Model.,

2007, 47 (1), 47-58