Similarity and Diversity Alexandre Varnek, University of Strasbourg, France - PowerPoint PPT Presentation

Loading...

PPT – Similarity and Diversity Alexandre Varnek, University of Strasbourg, France PowerPoint presentation | free to view - id: 3ccd31-MDQ1N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France

Description:

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 81
Provided by: infochimU
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Similarity and Diversity Alexandre Varnek, University of Strasbourg, France


1
Similarity and DiversityAlexandre Varnek,
University of Strasbourg, France
2
What is similar?
3
Different spaces, classified by
4
16 diverse aldehydes...
5
...sorted by common scaffold
6
...sorted by functional groups
7
The Similarity Principle Structurally
similar molecules are assumed to have similar
biological properties
Compounds active as opioid receptors
8
Structural Spectrum of Thrombin Inhibitors
structural similarity fading away
reference compounds
0.56
0.72
0.53
0.84
0.67
0.52
0.82
0.64
0.39
9
  • Key features in similarity/diversity
    calculations
  • Properties to describe elements (descriptors,
    fingerprints)
  • Distance measure (metrics)

10
N-Dimensional Descriptor Space
descriptorn
  • Each chosen descriptor adds a dimension to the
    reference space
  • Calculation of n descriptor values produces an
    n-dimensional coordinate vector in descriptor
    space that determines the position of a molecule

descriptor2
descriptor1
descriptor3
11
Chemical Reference Space
  • Distance in chemical space is used as a measure
    of molecular similarity and dissimilarity
  • Molecular similarity covers only chemical
    similarity but also property similarity including
    biological activity

DAB
B
A
12
Distance Metrics in n-D Space
  • If two molecules have comparable values in all
    the n descriptors in the space, they are located
    close to each other in the n-D space.
  • how to define closeness in space as a measure
    of molecular similarity?
  • distance metrics

13
Descriptor-based Similarity
  • When two molecules A and B are projected into an
    n-D space, two vectors, A and B, represent their
    descriptor values, respectively.
  • A (a1,a2,...an)
  • B (b1,b2,...bn)
  • The similarity between A and B, SAB, is
    negatively correlated with thedistance DAB
  • shorter distance more similar molecules
  • in the case of normalized distance(within value
    range 0,1), similarity 1 distance

B
DAB
DBC
C
A
DABgtDBC ? SABltSBC
14
Metrics Properties
  • The distance values dAB? 0 dAA dBB 0
  • Symmetry properties dAB dBA
  • Triangle inequality dAB? dAC dBC

B
DAB
DBC
C
A
15
Euclidean Distance in n-D Space
  • Given two n-dimensional vectors, A and B
  • A (a1,a2,...an)
  • B (b1,b2,...bn)
  • Euclidean distance DAB is defined as
  • Example
  • A (3,0,1) B (5,2,0)
  • DAB
    3

16
Manhattan Distance in n-D Space
  • Given two n-dimensional vectors, A and B
  • A (a1,a2,...an)
  • B (b1,b2,...bn)
  • Manhattan distance DAB is defined as
  • Example
  • A (3,0,1) B (5,2,0)
  • DAB
    5

17
Distance Measures (Metrics)
Euclidian distance (x11 - x21) 2 (x12 -
x22)2 1/2 (42 22)1/2 4.472
Manhattan (Hamming) distance x11 - x21 x12
- x22 4 2 6
Sup distance Max (x11 - x21, x12 - x22)
Max (4, 2) 4
18
Binary Fingerprint
19
Popular Similarity/Distance Coefficients
  • Similarity metrics
  • Tanimoto coefficient
  • Dice coefficient
  • Cosine coefficient
  • Distance metrics
  • Euclidean distance
  • Hamming distance
  • Soergel distance

20
Tanimoto Coefficient (Tc)
  • Definition
  • value range 0,1
  • Tc is also known as Jaccard coefficient
  • Tc is the most popular similarity coefficient

21
Example Tc Calculation
22
Dice Coefficient
  • Definition
  • value range 0,1
  • monotonic with the Tanimoto coefficient

23
Cosine Coefficient
  • Definition
  • Properties
  • value range 0,1
  • correlated with the Tanimoto coefficient but not
    strictly monotonic with it

24
Hamming Distance
  • Definition
  • value range 0,N (N, length of the
    fingerprint)
  • also called Manhattan/City Block distance

25
Soergel Distance
  • Definition
  • Properties
  • value range 0,1
  • equivalent to (1 Tc) for binary fingerprints

26
(No Transcript)
27
(No Transcript)
28
Similarity coefficients
29
Properties of Similarlity and Distance
Coefficients
Metric Properties
  • The distance values dAB? 0 dAA dBB 0
  • Symmetry properties dAB dBA
  • Triangle inequality dAB? dAC dBC

The Euclidean and Hamming distances and the
Tanimoto coefficients (dichotomous variables)
obey all properties. The Tanimoto, Dice and
Cosine coefficients do not obey inequality (3).
Coefficients are monotonic if they produce the
same similarlity ranking
30
Similarity search
Using bit strings to encode molecular size. A
biphenyl query is compared to a series of
analogues of increasing size. The Tanimoto
coefficient, which is shown next to the
corresponding structure, decreases with
increasing size, until a limiting value
is reached.
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,
No. 3, 1998, pp. 379-386
31
Similarity search
Molecular similarity at a range of Tanimoto
coefficient values
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,
No. 3, 1998, pp. 379-386
32
Similarity search
The distribution of Tanimoto coefficient values
found in database searches with a range of query
molecules of increasing size and complexity
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38,
No. 3, 1998, pp. 379-386
33
Molecular Similarity
A comparison of the Soergel and Hamming distance
values for two pairs of structures to illustrate
the effect of molecular size
A R. Leach and V. J. Gillet "An Introduction to
Chemoinformatics" , Kluwer Academic Publisher,
2003
34
Molecular Similarity
The maximum common subgraph (MCS) between the two
molecules is in bold
Similarity Nbonds(MCS) / Nbonds(query)
A R. Leach and V. J. Gillet "An Introduction to
Chemoinformatics" , Kluwer Academic Publisher,
2003
35
Activity landscape
36
How important is a choice of descriptors ?
Inhibitors of acyl-CoAcholesterol
acyltransferase represented with MACCS (a), TGT
(b), and Molprint2D (c) fingerprints.
37
discontinuous SARs
continuous SARs
  • gradual changes in structure result in moderate
    changes in activity
  • rolling hills (G. Maggiora)
  • small changes in structure have dramatic effects
    on activity
  • cliffs in activity landscapes

Structure-Activity Landscape Index SALIij
DAij / DSij
DAij (DSij ) is the difference between
activities (similarities) of molecules i and j R.
Guha et al. J.Chem.Inf.Mod., 2008, 48, 646
38
discontinuous SARs
VEGFR-2 tyrosine kinase inhibitors
  • small changes in structure have dramatic effects
    on activity
  • cliffs in activity landscapes
  • lead optimization, QSAR

bad news for molecular similarity analysis...
39
Example of a Classical Discontinuous SAR
Any similarity method must recognize
these compounds as being similar ...
(MACCS Tanimoto similarity)
Adenosine deaminase inhibitors

40
Libraries design
Goal to select a representative subset from a
large database
41
Chemical Space
Overlapping similarity radii ? Redundancy Void
regions ? Lack of information
42
Chemical Space
Void regions ? Lack of information
43
Chemical Space
No redundancy, no voids ? Optimally diverse
compound library
44
Subset selection from the libraries
  • Clustering
  • Dissimilarity-based methods
  • Cell-based methods
  • Optimisation techniques

45
Clustering in chemistry
46
What is clustering?
  • Clustering is the separation of a set of objects
    into groups such that items in one group are more
    like each other than items in a different group
  • A technique to understand, simplify and
    interpret large amounts of multidimensional data
  • Classification without labels (unsupervised
    learning)

47
Where clustering is used ?
  • General
  • data mining, statistical data analysis, data
    compression, image segmentation, document
    classification (information retrieval)
  • Chemical
  • representative sample,
  • subsets selection,
  • classification of new compounds

48
Overall strategy
  • Select descriptors
  • Generate descriptors for all items
  • Scale descriptors
  • Define similarity measure ( metrics )
  • Apply appropriate clustering method to group
    the items on basis of chosen descriptors and
    similarity measure
  • Analyse results

49
Data Presentation
descriptors
molecules
molecules
molecules
Pattern matrix
Proximity matrix
Library contains n molecules, each molecule is
described by p descriptors
dii 0 dij dji
50
Clustering methods
Single Link
Complete Link
Group Average
Hierarchical
Weighted Gr Av
Monothetic
Centroid
Polythetic
Median
Single Pass
Ward
Jarvis-Patrick
Nearest Neighbour
Mixture Model
Non-hierarchical
Relocation
Topographic
Others
51
Hierarchical Clustering
divisive
agglomerative
A dendrogram representing an hierarchical
clustering of 7 compounds
52
Sequential Agglomerative Hierarchical
Non-overlapping (SAHN) methods
Group average
Simple link
Complete link
In the Single Link method, the intercluster
distance is equal to the minimum distance between
any two compounds, one from each cluster.
In the Complete Link method, the intercluster
distance is equal to the furthest distance
between any two compounds, one from each cluster.
The Group Average method measures intercluster
distance as the average of the distances between
all compounds in the two clusters.
53
Hierarchical Clustering Johnsons method
  • The algorithm is an agglomerative scheme that
    erases rows and columns in the proximity matrix
    as old clusters are merged into new ones.

Step 1. Group the pair of objects into a cluster
Step 2. Update the proximity matrix
54
Hierarchical Clustering single link
55
Hierarchical Clustering complete link
56
Hierarchical Clustering single vs complete
link
57
Non-Hierarchical Clustering the Jarvis-Patrick
method
  • At the first step, all nearest neighbours of each
    compound are found by calculating of all paiwise
    similarities and sorting according to this
    similarlity.
  • Then, two compounds are placed into the same
    cluster if
  • They are in each others list of m nearest
    neighbours.
  • They have p (where plt m) nearest neighbours in
    common. Typical values m 14 p 8.
  • Pb too many singletons.

58
Non-Hierarchical Clustering the relocation
methods
  • Relocation algorithms involve an initial
    assignment of compounds to clusters, which is
    then iteratively refined by moving (relocating)
    certain compounds from one cluster to another.
  • Example the K-means method
  • Random choise of c  seed  compounds. Other
    compounds are assigned to the nearest seed
    resulting in an initial set of c clusters.
  • The centroides of cluster are calculated. The
    objects are re-assigned to the nearest cluster
    centroid.
  • Pb the method is dependent upon the initial
    set of cluster centroids.

59
Efficiency of Clustering Methods
N is the number of compounds and M is the number
of clusters
60
Validity of clustering
  • How many clusters are in the data ?
  • Does partitioning match the categories ?
  • Where should be the dendrogram be cut ?
  • Which of two partitions fit the data better ?

61
Dissimilarity-Based Compound Selection (DBCS)
  • 4 steps basic algorithm for DBCS
  • Select a compound and place it in the subset.
  • Calculate the dissimilarity between each compound
    remaining in the data set and the compounds in
    the subset.
  • Choose the next compound as that which is most
    dissimilar to the compounds in subset.
  • If n lt n0 (n0 being the desired size number of
    compounds in the final subset), return to step 2.

62
Dissimilarity-Based Compound Selection (DBCS)
  • Basic algorithm for DBCS
  • 1st step selection of the initial compound
  • Select it at random
  • Choose the molecules which is  most
    representative  (e.g., has the largest sum of
    similarlities to other molecules)
  • Choose the molecules which is  most dissimilar 
    (e.g., has the smallest sum of similarlities to
    other molecules).

63
Dissimilarity-Based Compound Selection (DBCS)
  • Basic algorithm for DBCS
  • 2nd step calculation of dissimilarity
  • Dissimilarity is the opposite of similarity
  • (Dissimilarity)i,j 1 (Similarity )i,j
  • (where  Similarity  is Tanimoto, or Dice, or
    Cosine, coefficients)

64
Diversity
  • Diversity characterises a set of molecules

65
Dissimilarity-Based Compound Selection (DBCS)
  • Basic algorithm for DBCS
  • 3nd step selection the most dissimilar compound

There are several methods to select a diversed
subset containing m compounds
66
  • Basic algorithm for DBCS
  • 3nd step selection the most dissimilar compound

3). The Sphere Exclusion Algorithm
  • Define a threshold dissimilarity, t
  • Select a compound and place it in the subset.
  • Remove all molecules from the data set that have
    a dissimilarity to the selected molecule of less
    than t
  • Return to step 2 if there are molecules remaining
    in the data set.
  • The next compound can be selected
  • randomly
  • using MinMax-like method

67
DBCS Subset selection from the libraries
68
Cell-based methods
  • Cell-based or Partitioning methods operated
    within a predefined low-dimentional chemical
    space.
  • If there are K axes (properties) and each is
    devided into bi bins, then the number of cells
    Ncells in the multidimentianal space is

69
Cell-based methods
The construction of 2-dimentional chemical space.
LogP bins lt0, 0-0.3, 3-7 and gt7 MW bins
0-250, 250-500, 500-750, gt 750.
70
Cell-based methods
  • A key feature of Cell-based methods is that they
    do not requere the calculation of paiwise
    distances Di,j between compounds the chemical
    space is defined independently of the molecules
    that are positioned within it.
  • Advantages of Cell-based methods
  • Empty cells (voids) or cells with low ocupancy
    can be easily identified.
  • The diversity of different subsets can be easily
    compared by examining the overlap in the cells
    occupied by each subset.

Main pb Cell-based methods are restricted to
relatively low-dimentional space
71
Optimisation techniques
  • DBCS methods prepare a diverse subset selecting
    interatively ONE molecule a time.
  • Optimisation techniques provide an efficient ways
    of sampling large search spaces and selection of
    diversed subsets

72
Optimisation techniques
  • Example Monte-Carlo search
  • Random selection of an initial subset and
    calculation of its diversity D.
  • 2. A new subset is generated from the first
    by replacing some of its compounds with other
    randomly selected.
  • 3. The diversity of the new subset Di1 is
    compared with Di
  • if DD Di1 - Di gt 0, the new set is
    accepted
  • if DD lt 0, the probability of acceptence
    depends on the Metropolis condition, exp(- DD /
    kT).

73
Scaffolds and Frameworks
74
Frameworks
Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,
2887-2893
75
Frameworks
Dissection of a molecule according to Bemis and
Murcko. Diazepam contains three sidechains and
one framework with two ring systems and a
zero-atom linker.
G. Schneider, P. Schneider, S. Renner, QSAR
Comb.Sci. 25, 2006, No.12, 1162 1171
76
Graph Frameworks for Compounds in the CMC
Database (Numbers Indicate Frequency of
Occurrence)
Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,
2887-2893
77
Scaffolds et Frameworks
  • Lalgorithme de Bemis et Murcko de génération de
    framework
  • les hydrogènes sont supprimés,
  • les atomes avec une seule liaison sont supprimés
    successivement,
  • le scaffold est obtenu,
  • tous les types datomes sont définis en tant que
    C et tous les types de liaisons
  • sont définis en tant que simples liaisons, ce qui
    permet dobtenir le framework.

Bemis, G.W. Murcko, M.A. J.Med.Chem 1996, 39,
2887-2893
Contrairement à la méthode de Bemis et Murcko, A.
Monge a proposé de distinguer les liaisons
aromatiques et non aromatiques (thèse de
doctorat, Univ. Orléans, 2007)
78
Scaffold-Hopping How Far Can You Jump? G.
Schneider, P. Schneider, S. Renner, QSAR
Comb.Sci. 25, 2006, No.12, 1162 1171
79
The Scaffold Tree - Visualization of the Scaffold
Universe by Hierarchical Scaffold
Classification A. Schuffenhauer, P. Ertl, S.
Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J.
Chem. Inf. Model., 2007, 47 (1), 47-58
80
Scaffold tree for the results of pyruvate kinase
assay. Color intensity represents the ratio of
active and inactive molecules with these
scaffolds.
A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel,
M. A. Koch, and H.Waldmann J. Chem. Inf. Model.,
2007, 47 (1), 47-58
About PowerShow.com