Analysis of Multiple QSAR Models - A Basis for Experimental Design - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Analysis of Multiple QSAR Models - A Basis for Experimental Design

Description:

... CV-R2 0.81 vs 0.79 Multiple Models Consider many possible models All equally probable Use quantitative and ... GFA Relationship ... Structure-Activity ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 30
Provided by: ErikJ5
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Multiple QSAR Models - A Basis for Experimental Design


1
Analysis of Multiple QSAR Models-A Basis for
Experimental Design
Eric Jamois, Ph.D. and Amit Kulkarni, Ph. D.
2
QSAR Problems
  • Find and report unsuspected trends between
    biological activity and molecular descriptors
  • Linear or non-linear behavior
  • Variable interactions

and Solutions
  • Look at large ensembles of solutions
  • Consider variables in combinations
  • Efficient variable selection via GA
  • Provide multiple models
  • Critical analysis
  • Experimental design

3
Single QSAR Model
Training Set
MLR, PLS, PCR Stepwise Linearetc
QSAR Model
Candidate Molecules
Prioritized Candidates
4
Simplistic Approach
  • Large error in biological activities
  • Largely under determined situation
  • Consider 2 models CV-R2 0.81 vs 0.79

Multiple Models
  • Consider many possible models
  • All equally probable
  • Use quantitative and qualitative analysis
  • Statistics
  • Intuition

5
Multiple QSAR Models
Training Set
GFA, G/PLS
Model 1
Model 2
Model 3
...
QSAR Models
- Prediction safe Consensus Prediction or -
Prediction sensitive Experimental Design
Candidate Molecules
6
GA Variable Selection
Generation 1
Term 1
Term 2
Term 3
Term 4
Eq. 1
Eq. 2
Crossover point
Generation 2
Eq. 1
Eq. 2
7
GA Variable Usage
8
GA Model Selection
9
Analysis of Models from GFA
  • Relationship between models (which are
    similar/different to each other?)
  • Are the best models (based on LOF) similar to
    each other or different ?
  • How to sample the ensemble of models ?
  • How many to select ?
  • Which ones ?
  • Can we find an average/consensus model ?

10
Dataset
D. L. Selwood et al. Structure-Activity
Relationships of Antifilarial Antimycin
Analogues A Multivariate Pattern Recognition
Study, J. Med. Chem., 33, 136-142, 1990.
Conditions for GFA
  • Run GFA with
  • Population size 200 models
  • Initial equation length 3 and d2
  • Number of iterations 20,000

11
Analysis of GFA Models
  • Objectives
  • Identify each models on a graphical
    representation
  • Similar models (eg giving similar predicted
    values) are close to each other. Significantly
    different models are distant.
  • Provide capabilities for analyzing models Do
    the best models correspond to a single prediction
    scheme (cluster of equivalent models) or several
    schemes (dispersed models) ?

12
Analysis of GFA Models
Usual Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Perform PCA on residual
columns 5. Inspect descriptor plot
Problem Visualization only
13
Graphical Representation of Models
PCA Descriptor Plot
14
Analysis of GFA Models
Proposed Method 1. Select the top 1/2 (100) from
the 200 models 2. Generate predicted values
columns for these 100 models 3. Generate
residual value columns (Pred. - Actual) for these
100 models 4. Generate a correlation matrix
100x100 between the residual columns 5. Analyze
matrix by MDS and generate MDS coordinates for
N3
15
Graphical Representation of Models
PCA Descriptor Plot
MDS Samples Plot
16
Analysis of GFA Models
Proposed Method MDS on Correlation Matrix
17
Graphical Analysis of Models
  • Models can be colored by LOF(a) or Adj. R2 (b)
  • Other possible coloring schemes
  • F-test values
  • Nvars
  • R
  • LSE

a
b
18
Clustering/Selection of Models
Clustering of models
Selection of 10 diverse models
19
Selection of Basis Models
D. Rogers Evolutionary Statistics Using a
Genetic Algorithm and Model Reduction to Isolate
Alternate Statistical Hypotheses of Experimental
Data, Proceedings 7th International Conference
on Genetic Algorithms, East Lansing, MI (1997)
  • Procedure
  • Perform PCA on residual columns (last component
    should retain at least 1/N of original variance
  • For each retained component, select model which
    residual most correlates with that component

20
Selection of Basis Models
Component Variance Base Model Correlation PC1
79.05 M36 0.972 PC2 7.12 M98 0.772 PC3 3.5
M72 0.488
Alternate Selection
Component Variance Base Model Correlation PC1
79.05 M23 0.970 PC2 7.12 M58 0.720 PC3 3.5
M72 0.488
21
Representation of Basis Models
Initial Selection M36, M98, M72
Alternate Selection M23, M58, M72
22
Identification of Consensus Model
  • Objective
  • Select a central model with minimal
    contradictions vs other possible models

Consensus Model M86
23
Use in Experimental Design
  • Objective
  • First Identify orthogonal models (basis models)
    corresponding to radically different prediction
    schemes
  • Then Identify compounds which are prediction
    safe Predicted active by selected models
  • or
  • Identify compounds which are prediction
    sensitive Predicted differently across selected
    models

24
Application of Multiple QSAR Analysis Methodology
on a set of Dopamine beta-Hydroxylase Inhibitors
25
Training Set and Test Set Selection
  • Chemically and Biologically Diverse
  • Divided into three classes based on Biological
    Activity
  • Selected a set of diverse compounds from each
    group using distance based MaxMin diversity
    Metric
  • 37 molecules in Training set and 10 molecules in
    test set (1/37 2.7)

26
Selection of Basis Models
4 Components needed so that the last component
explains at least 2.7 of variance Component
Variance Base Model Correlation PC1 70.3 M5 0.9
2 PC2 14.5 M12 0.80 PC3 7.3 M49 0.74 PC4 3.
6 M52 0.45
27
Prediction of Test Set Using Basis Models
28
Experimental Design in DBH Set
29
Conclusions
  • We can provide new capabilities for analyzing
    multiple models from GFA using standard tools in
    Cerius2.
  • Differentiate between original and redundant
    models
  • Identify clustered or dispersed nature of best
    models
  • Provide reasonable sampling of models
  • Critical analysis and causality remain paramount.
Write a Comment
User Comments (0)
About PowerShow.com