Analysis of Multiple QSAR Models - A Basis for Experimental Design - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Analysis of Multiple QSAR Models - A Basis for Experimental Design

Description:

... CV-R2 0.81 vs 0.79 Multiple Models Consider many possible models All equally probable Use quantitative and ... GFA Relationship ... Structure-Activity ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 30

Provided by: ErikJ5

Category:

more less

Transcript and Presenter's Notes

Title: Analysis of Multiple QSAR Models - A Basis for Experimental Design

1
Analysis of Multiple QSAR Models-A Basis for
Experimental Design
Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.
2
QSAR Problems

Find and report unsuspected trends between
biological activity and molecular descriptors
Linear or non-linear behavior
Variable interactions

and Solutions

Look at large ensembles of solutions
Consider variables in combinations
Efficient variable selection via GA
Provide multiple models
Critical analysis
Experimental design

3
Single QSAR Model
Training Set
MLR, PLS, PCR Stepwise Linearetc
QSAR Model
Candidate Molecules
Prioritized Candidates
4
Simplistic Approach

Large error in biological activities
Largely under determined situation
Consider 2 models CV-R2 0.81 vs 0.79

Multiple Models

Consider many possible models
All equally probable
Use quantitative and qualitative analysis
Statistics
Intuition

5
Multiple QSAR Models
Training Set
GFA, G/PLS
Model 1
Model 2
Model 3
...
QSAR Models
- Prediction safe Consensus Prediction or -
Prediction sensitive Experimental Design
Candidate Molecules
6
GA Variable Selection
Generation 1
Term 1
Term 2
Term 3
Term 4
Eq. 1
Eq. 2
Crossover point
Generation 2
Eq. 1
Eq. 2
7
GA Variable Usage
8
GA Model Selection
9
Analysis of Models from GFA

Relationship between models (which are
similar/different to each other?)
Are the best models (based on LOF) similar to
each other or different ?
How to sample the ensemble of models ?
How many to select ?
Which ones ?
Can we find an average/consensus model ?

10
Dataset
D. L. Selwood et al. Structure-Activity
Relationships of Antifilarial Antimycin
Analogues A Multivariate Pattern Recognition
Study, J. Med. Chem., 33, 136-142, 1990.
Conditions for GFA

Run GFA with
Population size 200 models
Initial equation length 3 and d2
Number of iterations 20,000

11
Analysis of GFA Models

Objectives
Identify each models on a graphical
representation
Similar models (eg giving similar predicted
values) are close to each other. Significantly
different models are distant.
Provide capabilities for analyzing models Do
the best models correspond to a single prediction
scheme (cluster of equivalent models) or several
schemes (dispersed models) ?

12
Analysis of GFA Models
Usual Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Perform PCA on residual
columns 5. Inspect descriptor plot
Problem Visualization only
13
Graphical Representation of Models
PCA Descriptor Plot
14
Analysis of GFA Models
Proposed Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Generate a correlation matrix
100x100 between the residual columns 5. Analyze
matrix by MDS and generate MDS coordinates for
N3
15
Graphical Representation of Models
PCA Descriptor Plot
MDS Samples Plot
16
Analysis of GFA Models
Proposed Method MDS on Correlation Matrix
17
Graphical Analysis of Models

Models can be colored by LOF(a) or Adj. R2 (b)
Other possible coloring schemes
F-test values
Nvars
R
LSE

a
b
18
Clustering/Selection of Models
Clustering of models
Selection of 10 diverse models
19
Selection of Basis Models
D. Rogers Evolutionary Statistics Using a
Genetic Algorithm and Model Reduction to Isolate
Alternate Statistical Hypotheses of Experimental
Data, Proceedings 7th International Conference
on Genetic Algorithms, East Lansing, MI (1997)

Procedure
Perform PCA on residual columns (last component
should retain at least 1/N of original variance
For each retained component, select model which
residual most correlates with that component

20
Selection of Basis Models
Component Variance Base Model Correlation PC1
79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5
M72 0.488
Alternate Selection
Component Variance Base Model Correlation PC1
79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5
M72 0.488
21
Representation of Basis Models
Initial Selection M36, M98, M72
Alternate Selection M23, M58, M72
22
Identification of Consensus Model

Objective
Select a central model with minimal
contradictions vs other possible models

Consensus Model M86
23
Use in Experimental Design

Objective
First Identify orthogonal models (basis models)
corresponding to radically different prediction
schemes
Then Identify compounds which are prediction
safe Predicted active by selected models
or
Identify compounds which are prediction
sensitive Predicted differently across selected
models

24
Application of Multiple QSAR Analysis Methodology
on a set of Dopamine beta-Hydroxylase Inhibitors
25
Training Set and Test Set Selection

Chemically and Biologically Diverse
Divided into three classes based on Biological
Activity
Selected a set of diverse compounds from each
group using distance based MaxMin diversity
Metric
37 molecules in Training set and 10 molecules in
test set (1/37 2.7)

26
Selection of Basis Models
4 Components needed so that the last component
explains at least 2.7 of variance Component
Variance Base Model Correlation PC1 70.3 M5 0.9
2 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.
6 M52 0.45
27
Prediction of Test Set Using Basis Models
28
Experimental Design in DBH Set
29
Conclusions

We can provide new capabilities for analyzing
multiple models from GFA using standard tools in
Cerius2.
Differentiate between original and redundant
models
Identify clustered or dispersed nature of best
models
Provide reasonable sampling of models
Critical analysis and causality remain paramount.

Write a Comment

User Comments (0)