Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition

Description:

Several other last minute hacks. Outcome. Winning Entry: Weighted: 68.4 ... Learning Bayesian network models of different complexity (2 to 12 features) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 22
Provided by: KRe90
Category:

less

Transcript and Presenter's Notes

Title: Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition


1
Prediction of Molecular Bioactivity for Drug
Design Experiences from the KDD Cup 2001
competition
  • Sunita Sarawagi, IITB
  • http//www.it.iitb.ac.in/sunita
  • Joint work with
  • B. Anuradha, IITB
  • Anand Janakiraman, IITB
  • Jayant Haritsa, IISc

2
The dataset
  • Dataset provided by DuPont Pharmaceuticals
  • Activity of compounds binding to thrombin
  • Library of compounds included
  • 1909 known molecules (42 actively binding
    thrombin)
  • 139,351 binary features describe the 3-D
    structure of each compound
  • 636 new compounds with unknown capacity to bind
    to thrombin

3
Sample data
  • 0,1,0,0,0,0, ,0,0,0,0,0,0,I
  • 0,0,0,0,0,0, ,0,0,0,0,0,1,I
  • 0,0,0,0,0,0, ,0,0,0,0,0,0,I
  • 0,0,0,0,0,0, ,0,0,0,0,0,0,I
  • 0,1,0,0,0,1, ,0,1,0,0,0,1,A
  • 0,1,0,0,0,1, ,0,1,0,0,0,1,A
  • 0,1,0,0,0,1, ,0,1,0,0,1,1,?
  • 0,1,1,0,0,1, ,0,1,1,0,0,1,?

4
Challenges
  • Large number of binary features, significantly
    fewer training instances
  • 140,000 vs 2000!
  • Highly skewed
  • 1867 In-actives, 42 Actives.
  • Varying degrees of correlation among features
  • Differences in the training and test distributions

5
Steps
  • Familiarization with data
  • data has noise, four equal records (all 0s) with
    different labels
  • Lots more 0s than 1s
  • Number of 1s significantly higher for As than Is
  • Feature selection
  • Build classifiers
  • Combine classifiers
  • Incorporate unlabeled test instances

6
First step feature selection
  • Most commercial classifiers cannot handle 140,000
    features even with 1 GB memory.
  • Entropy-based individual feature selection
  • Does not handle redundant attributes.
  • Step-wise feature selection
  • Too brittle
  • Top entropy attribute with a 1 in each active
    compound
  • Exploiting small counts of Actives

Want all important groups of redundant attributes
7
Building classifiers
  • Partition training data using stratified sampling
  • Two-thirds training data
  • One-third validation data
  • Classification methods attempted
  • Decision tree classifiers
  • Naïve-Bayes
  • SVMs
  • Hand-crafted clustering/nearest neighbor hybrid

8
Decision Tree
  • C4.5

f88235 1
f137567 1
A (10)
f80106 1
A (5)
f26913 1
A (4)
A I
A 3 7
I 1 459
f135832 1
A (3)
f25144 1
A (2)
I (338/6)
A (2)
9
Naïve Bayes
  • Data characteristics very similar to text
  • lots of features, sparse data, few ones
  • Naïve Bayes found very effective for text
    classification
  • Accuracy All actives misclassified!

A I
A 0 10
I 1 459
10
Support vector machines
  • Has received lots of attention recently
  • Requires tuning which kernel, what parameters?
  • Several freely available packages SVMTorch
  • Accuracy slightly worse than decision trees

fj
fi
11
Hand-crafted hybrid
  • Find features such that actives cluster together
    using appropriate distance measure

12
Incremental Feature Selection
  • Pick features ONE by ONE
  • that result in maximum clustering of the actives.
  • And maximum separation from the inactives.
  • Objective function
  • Maximum separation between centroids of the
    Actives and In-actives
  • Distance function matching ones
  • Careful selection of training Actives.
  • Accuracy 100, 493 features

13
Final approach
  • Test data significantly denser
  • Methods like SVM, NB, clustering-based will not
    generalize
  • Preferred distribution independent method
  • Ensemble of Decision Trees
  • On disjoint attributes --- unconventional
  • Semi-supervised training
  • Introduce feedback from the test data in multiple
    rounds

14
Building tree ensembles
  • Initially picked 20000 features based on
    entropy.
  • More than one tree to take care of large feature
    space.
  • Repeat until accuracy on validation data does not
    drop
  • All groups of redundant features exploited.

15
Incorporating unlabeled instances
  • Augment training data with sure test instances.
  • Re-train another ensemble of trees using same
    method
  • Include more unlabelled instances with sure
    predictions
  • Repeat few more times...
  • How to capture drift?

16
Capturing drift
  • Solution Validate with independent data
  • Be sure to include only correctly labeled data
  • First approach Same prediction by all trees
  • On validation data, found errors in this scheme
  • Pruning not a solution
  • Weighted prediction by each tree
  • Weight fraction of Actives
  • Pick the right threshold using validation data.
  • Stop when no more unlabelled data can be added

17
Final state
  • Three rounds each with about 6 trees
  • Unlabelled data included
  • 126 actives 311 inactives
  • Remaining
  • 200 in confusion
  • Use meta-learner on validation data to pick final
    criteria
  • Sum of scores times number of trees claiming
    Actives
  • Several other last minute hacks.

18
Outcome
Home Team
Winning Entry Weighted 68.4 Accuracy 70.03
19
Winners method
  • Pre-processing Feature subset selection using
    mutual information (200 of 139,351 features)
  • Learning Bayesian network models of different
    complexity (2 to 12 features)
  • Choosing a model (ROC area, model complexity)

20
Postmortem Was all this necessary?
  • Without semi-supervised learning
  • Single decision tree 49
  • 6-tree ensemble on training data alone
  • Majority 57
  • Confidence weighted 63
  • With unlabelled data 64.3

21
Lessons learnt
  • Products
  • Need tools that scale in number of features
  • Research problems
  • Classifiers that are not tied to distribution
    similarity with the training data
  • More principled way of including unlabelled
    instances.
Write a Comment
User Comments (0)
About PowerShow.com