Title: Analysis of Multiple QSAR Models - A Basis for Experimental Design
1Analysis of Multiple QSAR Models-A Basis for
Experimental Design
Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.
2QSAR Problems
- Find and report unsuspected trends between
biological activity and molecular descriptors - Linear or non-linear behavior
- Variable interactions
and Solutions
- Look at large ensembles of solutions
- Consider variables in combinations
- Efficient variable selection via GA
- Provide multiple models
- Critical analysis
- Experimental design
3Single QSAR Model
Training Set
MLR, PLS, PCR Stepwise Linearetc
QSAR Model
Candidate Molecules
Prioritized Candidates
4Simplistic Approach
- Large error in biological activities
- Largely under determined situation
- Consider 2 models CV-R2 0.81 vs 0.79
Multiple Models
- Consider many possible models
- All equally probable
- Use quantitative and qualitative analysis
- Statistics
- Intuition
5Multiple QSAR Models
Training Set
GFA, G/PLS
Model 1
Model 2
Model 3
...
QSAR Models
- Prediction safe Consensus Prediction or -
Prediction sensitive Experimental Design
Candidate Molecules
6GA Variable Selection
Generation 1
Term 1
Term 2
Term 3
Term 4
Eq. 1
Eq. 2
Crossover point
Generation 2
Eq. 1
Eq. 2
7GA Variable Usage
8GA Model Selection
9Analysis of Models from GFA
- Relationship between models (which are
similar/different to each other?) - Are the best models (based on LOF) similar to
each other or different ? - How to sample the ensemble of models ?
- How many to select ?
- Which ones ?
- Can we find an average/consensus model ?
10Dataset
D. L. Selwood et al. Structure-Activity
Relationships of Antifilarial Antimycin
Analogues A Multivariate Pattern Recognition
Study, J. Med. Chem., 33, 136-142, 1990.
Conditions for GFA
- Run GFA with
- Population size 200 models
- Initial equation length 3 and d2
- Number of iterations 20,000
11Analysis of GFA Models
- Objectives
- Identify each models on a graphical
representation - Similar models (eg giving similar predicted
values) are close to each other. Significantly
different models are distant. - Provide capabilities for analyzing models Do
the best models correspond to a single prediction
scheme (cluster of equivalent models) or several
schemes (dispersed models) ?
12Analysis of GFA Models
Usual Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Perform PCA on residual
columns 5. Inspect descriptor plot
Problem Visualization only
13Graphical Representation of Models
PCA Descriptor Plot
14Analysis of GFA Models
Proposed Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Generate a correlation matrix
100x100 between the residual columns 5. Analyze
matrix by MDS and generate MDS coordinates for
N3
15Graphical Representation of Models
PCA Descriptor Plot
MDS Samples Plot
16Analysis of GFA Models
Proposed Method MDS on Correlation Matrix
17Graphical Analysis of Models
- Models can be colored by LOF(a) or Adj. R2 (b)
- Other possible coloring schemes
- F-test values
- Nvars
- R
- LSE
a
b
18Clustering/Selection of Models
Clustering of models
Selection of 10 diverse models
19Selection of Basis Models
D. Rogers Evolutionary Statistics Using a
Genetic Algorithm and Model Reduction to Isolate
Alternate Statistical Hypotheses of Experimental
Data, Proceedings 7th International Conference
on Genetic Algorithms, East Lansing, MI (1997)
- Procedure
- Perform PCA on residual columns (last component
should retain at least 1/N of original variance - For each retained component, select model which
residual most correlates with that component
20Selection of Basis Models
Component Variance Base Model Correlation PC1
79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5
M72 0.488
Alternate Selection
Component Variance Base Model Correlation PC1
79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5
M72 0.488
21Representation of Basis Models
Initial Selection M36, M98, M72
Alternate Selection M23, M58, M72
22Identification of Consensus Model
- Objective
- Select a central model with minimal
contradictions vs other possible models
Consensus Model M86
23Use in Experimental Design
- Objective
- First Identify orthogonal models (basis models)
corresponding to radically different prediction
schemes - Then Identify compounds which are prediction
safe Predicted active by selected models - or
- Identify compounds which are prediction
sensitive Predicted differently across selected
models
24Application of Multiple QSAR Analysis Methodology
on a set of Dopamine beta-Hydroxylase Inhibitors
25Training Set and Test Set Selection
- Chemically and Biologically Diverse
- Divided into three classes based on Biological
Activity - Selected a set of diverse compounds from each
group using distance based MaxMin diversity
Metric - 37 molecules in Training set and 10 molecules in
test set (1/37 2.7)
26Selection of Basis Models
4 Components needed so that the last component
explains at least 2.7 of variance Component
Variance Base Model Correlation PC1 70.3 M5 0.9
2 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.
6 M52 0.45
27Prediction of Test Set Using Basis Models
28Experimental Design in DBH Set
29Conclusions
- We can provide new capabilities for analyzing
multiple models from GFA using standard tools in
Cerius2. - Differentiate between original and redundant
models - Identify clustered or dispersed nature of best
models - Provide reasonable sampling of models
- Critical analysis and causality remain paramount.