Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition

About This Presentation

Title:

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition

Description:

Several other last minute hacks. Outcome. Winning Entry: Weighted: 68.4 ... Learning Bayesian network models of different complexity (2 to 12 features) ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 22

Provided by: KRe90

Category:

more less

Transcript and Presenter's Notes

Title: Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition

1
Prediction of Molecular Bioactivity for Drug
Design Experiences from the KDD Cup 2001
competition

Sunita Sarawagi, IITB
http//www.it.iitb.ac.in/sunita
Joint work with
B. Anuradha, IITB
Anand Janakiraman, IITB
Jayant Haritsa, IISc

2
The dataset

Dataset provided by DuPont Pharmaceuticals
Activity of compounds binding to thrombin
Library of compounds included
1909 known molecules (42 actively binding
thrombin)
139,351 binary features describe the 3-D
structure of each compound
636 new compounds with unknown capacity to bind
to thrombin

3
Sample data

0,1,0,0,0,0, ,0,0,0,0,0,0,I
0,0,0,0,0,0, ,0,0,0,0,0,1,I
0,0,0,0,0,0, ,0,0,0,0,0,0,I
0,0,0,0,0,0, ,0,0,0,0,0,0,I
0,1,0,0,0,1, ,0,1,0,0,0,1,A
0,1,0,0,0,1, ,0,1,0,0,0,1,A
0,1,0,0,0,1, ,0,1,0,0,1,1,?
0,1,1,0,0,1, ,0,1,1,0,0,1,?

4
Challenges

Large number of binary features, significantly
fewer training instances
140,000 vs 2000!
Highly skewed
1867 In-actives, 42 Actives.
Varying degrees of correlation among features
Differences in the training and test distributions

5
Steps

Familiarization with data
data has noise, four equal records (all 0s) with
different labels
Lots more 0s than 1s
Number of 1s significantly higher for As than Is
Feature selection
Build classifiers
Combine classifiers
Incorporate unlabeled test instances

6
First step feature selection

Most commercial classifiers cannot handle 140,000
features even with 1 GB memory.
Entropy-based individual feature selection
Does not handle redundant attributes.
Step-wise feature selection
Too brittle
Top entropy attribute with a 1 in each active
compound
Exploiting small counts of Actives

Want all important groups of redundant attributes
7
Building classifiers

Partition training data using stratified sampling
Two-thirds training data
One-third validation data
Classification methods attempted
Decision tree classifiers
Naïve-Bayes
SVMs
Hand-crafted clustering/nearest neighbor hybrid

8
Decision Tree

C4.5

f88235 1
f137567 1
A (10)
f80106 1
A (5)
f26913 1
A (4)
A I
A 3 7
I 1 459
f135832 1
A (3)
f25144 1
A (2)
I (338/6)
A (2)
9
Naïve Bayes

Data characteristics very similar to text
lots of features, sparse data, few ones
Naïve Bayes found very effective for text
classification
Accuracy All actives misclassified!

A I
A 0 10
I 1 459
10
Support vector machines

Has received lots of attention recently
Requires tuning which kernel, what parameters?
Several freely available packages SVMTorch
Accuracy slightly worse than decision trees

fj
fi
11
Hand-crafted hybrid

Find features such that actives cluster together
using appropriate distance measure

12
Incremental Feature Selection

Pick features ONE by ONE
that result in maximum clustering of the actives.
And maximum separation from the inactives.
Objective function
Maximum separation between centroids of the
Actives and In-actives
Distance function matching ones
Careful selection of training Actives.
Accuracy 100, 493 features

13
Final approach

Test data significantly denser
Methods like SVM, NB, clustering-based will not
generalize
Preferred distribution independent method
Ensemble of Decision Trees
On disjoint attributes --- unconventional
Semi-supervised training
Introduce feedback from the test data in multiple
rounds

14
Building tree ensembles

Initially picked 20000 features based on
entropy.
More than one tree to take care of large feature
space.
Repeat until accuracy on validation data does not
drop
All groups of redundant features exploited.

15
Incorporating unlabeled instances

Augment training data with sure test instances.
Re-train another ensemble of trees using same
method
Include more unlabelled instances with sure
predictions
Repeat few more times...
How to capture drift?

16
Capturing drift

Solution Validate with independent data
Be sure to include only correctly labeled data
First approach Same prediction by all trees
On validation data, found errors in this scheme
Pruning not a solution
Weighted prediction by each tree
Weight fraction of Actives
Pick the right threshold using validation data.
Stop when no more unlabelled data can be added

17
Final state

Three rounds each with about 6 trees
Unlabelled data included
126 actives 311 inactives
Remaining
200 in confusion
Use meta-learner on validation data to pick final
criteria
Sum of scores times number of trees claiming
Actives
Several other last minute hacks.

18
Outcome
Home Team
Winning Entry Weighted 68.4 Accuracy 70.03
19
Winners method

Pre-processing Feature subset selection using
mutual information (200 of 139,351 features)
Learning Bayesian network models of different
complexity (2 to 12 features)
Choosing a model (ROC area, model complexity)

20
Postmortem Was all this necessary?

Without semi-supervised learning
Single decision tree 49
6-tree ensemble on training data alone
Majority 57
Confidence weighted 63
With unlabelled data 64.3

21
Lessons learnt

Products
Need tools that scale in number of features
Research problems
Classifiers that are not tied to distribution
similarity with the training data
More principled way of including unlabelled
instances.

Write a Comment

User Comments (0)

About PowerShow.com

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition - PowerPoint PPT Presentation

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition

Several other last minute hacks. Outcome. Winning Entry: Weighted: 68.4 ... Learning Bayesian network models of different complexity (2 to 12 features) ... – PowerPoint PPT presentation