Learning Hierarchical Multilabel Classification Trees for Functional Genomics - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Learning Hierarchical Multilabel Classification Trees for Functional Genomics

Description:

Saso Dzeroski (Jozef Stefan Institute, Ljubljana) Amanda Clare (U. of Wales, Aberystwyth) Jan Struyf (U. of Wisconsin, Madison) Leander Schietgat (K.U.Leuven) ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 24
Provided by: iri5
Category:

less

Transcript and Presenter's Notes

Title: Learning Hierarchical Multilabel Classification Trees for Functional Genomics


1
Learning Hierarchical Multilabel Classification
Trees for Functional Genomics
  • Hendrik Blockeel (K.U.Leuven)
  • In collaboration with
  • Saso Dzeroski (Jozef Stefan Institute, Ljubljana)
  • Amanda Clare (U. of Wales, Aberystwyth)
  • Jan Struyf (U. of Wisconsin, Madison)
  • Leander Schietgat (K.U.Leuven)

2
Whats surprising about our resultsThe
household equivalent
  • What you would NOT expect is
  • that the combo is smaller than
  • some individual machines
  • that the coffee is even better

yet that is what we found when learning
combined models for different tasks
3
Overview
  • Hierarchical multilabel classification (HMC)
  • Decision trees for HMC
  • Motivated by problems in functional genomics
  • Some experimental results
  • Conclusions

4
Classification settings
  • Normally, in classification, we assign one class
    label ci from a set C c1, , ck to each
    example
  • In multilabel classification, we have to assign a
    subset S ? C to each example
  • i.e., one example can belong to multiple classes
  • Some applications
  • Text classification assign subjects (newsgroups)
    to texts
  • Functional genomics assign functions to genes
  • In hierarchical multilabel classification (HMC),
    the classes C form a hierarchy C,?
  • Partial order ? expresses is a superclass of

5
Hierarchical multilabel classification
  • Hierarchy constraint
  • ci ? cj ? coverage(cj) ? coverage(ci)
  • Elements of a class must be elements of its
    superclasses
  • Should hold for given data as well as predictions
  • Three possible ways of learning HMC models
  • Learn k binary classifiers, one for each class
  • If learned independently difficult to guarantee
    hierarchical constraints
  • Can also be learned hierarchically
  • Learn one classifier that predicts a vector of
    classes
  • E.g., a neural net can have multiple outputs
  • We will show it can also be done for decision
    trees

6
1) Learn k classifiers independently
  • C c1, c2, , ck
  • Let S(x) set of class labels of x
  • Learning
  • For each i 1..k learn fi such that fi(x)1 if
    ci ? S(x)
  • Prediction
  • S(x) ?
  • For each i 1..k include ci in S(x) if fi(x)1
  • Predict S(x)
  • Problem hierarchy constraint may not hold for
    S
  • A classifier for a subclass might predict 1 when
    the classifier for its superclass predicts 0
  • Can be trivially fixed at prediction time but it
    shows that the fi have not been learned optimally

7
2) Learn k classifiers hierarchically
  • D data set Di x ? D ci ? S(x)
  • Di Data set restricted to class i examples
  • Parent(ci) immediate superclass of ci
  • Learning
  • For each i 1..k learn fi from Dparent(ci) such
    that fi(x)1 if ci ? S(x)
  • Prediction (typically dop-down in hierarchy)
  • S(x) ?
  • For each i 1..k include ci in S(x) if
    parent(ci) ? S(x) and fi(x)1
  • Predict S(x)
  • Advantages
  • Hierarchy constraint solved
  • More balanced distributions to learn from

.
c1
c2
c3
c4
c5
c6
c7
8
3) Learn one classifier
  • Learning
  • Learn f such that f(x) S(x)
  • Prediction
  • Predict f(x)
  • Need to have a learner that can learn models with
    gt1 output
  • E.g., neural nets can output k-D vector c1, ,
    ck
  • In this work, we extend decision trees (normally
    1-D output) to HMC trees
  • Trees interpretable theories
  • 1 tree is more interpretable than k trees
  • Risks
  • Perhaps that one tree will be much larger
  • Perhaps the tree will be much less accurate

9
Data mining with decision trees
  • Data mining learning a general model from
    specific observations
  • Decision trees are a popular format for models
    because
  • They are fast to build and fast to use
  • They make accurate predictions
  • They are easy to interpret

Name Age Salary Children Loan? Ann 25 29920
1 no Bob 32 40000 2
yes Carl 19 0 0 no Dirk
44 45200 3 yes .
. .
10
Functional genomics
  • Task Given a data set with descriptions of genes
    and the functions they have, learn a model that
    can predict for a new gene what functions it
    performs
  • A gene can have multiple functions (out of 250
    possible functions, in our case)
  • Could be done with decision trees, with all the
    advantages that brings But
  • Decision trees predict only one class, not a set
    of classes
  • Should we learn a separate tree for each
    function?
  • 250 functions 250 trees not so fast and
    interpretable anymore!

description
functions
Name A1 A2 A3 .. An 1 2 3 4 5 249
250 G1 x x
x x G2 x
x x G3
x x x
.
11
Multiple prediction trees
  • A multiple prediction tree (MPT) makes multiple
    predictions at once
  • Basic idea (Blockeel, De Raedt, Ramon, 1998)
  • A decision tree learner prefers tests that yield
    much information on the class attribute
    (measured using information gain (C4.5) or
    variance reduction (CART))
  • MPT learner prefers tests that reduce variance
    for all target variables together
  • Variance mean squared distance of vectors to
    mean vector, in k-D space

12
HMC tree learning
  • A special case of MPT learning
  • Main characteristics
  • Errors higher up in the hierarchy are more
    important
  • Use weighted euclidean distance (higher weight
    for higher classes)
  • Need to ensure hierarchy constraint
  • Normally, leaf predicts ci iff proportion of ci
    examples in leaf is above some threshold ti
    (often 0.5)
  • To ensure compliance with hierarchy constraint
  • ci ? cj ? ti ? tj
  • Automatically fulfilled if all ti equal

13
Example
.
.
c1
c2
c3
Weight 1
c1
c2
c3
x1
c4
c5
c6
c7
Weight 0.5
c4
c5
c6
c7
.
x1 c1, c3, c5 1,0,1,0,1,0,0 x2 c1, c3,
c7 1,0,1,0,0,0,1 x3 c1, c2, c5
1,1,0,0,0,0,0
c1
c2
c3
x2
c4
c5
c6
c7
d2(x1, x2) 0.25 0.25 0.5 d2(x1, x3) 11
2 x1 is more similar to x2 than to x3 DT tries
to create leaves with similar examples
.
x3
c1
c2
c3
c4
c5
c6
c7
14
The Clus system
  • Created by Jan Struyf
  • Propositional DT learner, implemented in Java
  • Implements ideas from
  • C4.5 (Quinlan, 93) and CART (Breiman et al.,
    84)
  • predictive clustering trees (Blockeel et al.,
    98)
  • includes multiple prediction trees and
    hierarchical multilabel classification trees
  • Reads data in ARFF format (Weka)
  • We used two versions for our experiments
  • Clus-HMC HMC version as explained
  • Clus-SC single classification version, /- CART

15
The datasets
  • 12 datasets from functional genomics
  • Each with a different description of the genes
  • Sequence statistics (1)
  • Phenotype (2)
  • Predicted secondary structure (3)
  • Homology (4)
  • Micro-array data (5-12)
  • Each with the same class hierarchy
  • 250 classes distributed over 4 levels
  • Number of examples 1592 to 3932
  • Number of attributes 52 to 47034

16
Our expectations
  • How does HMC tree learning compare to the
    straightforward approach of learning 250 trees?
  • We expect
  • Faster learning Learning 1 HMCT is slower than
    learning 1 SPT (single prediction tree), but
    faster than learning 250 SPTs
  • Much faster prediction Using 1 HMCT for
    prediction is as fast as using 1 SPT for
    prediction, and hence 250 times faster than using
    250 SPTs
  • Larger trees HMCT is larger than average tree
    for 1 class, but smaller than set of 250 trees
  • Less accurate HMCT is less accurate than set of
    250 SPTs (but hopefully not much less accurate)
  • So how much faster / simpler / less accurate are
    our HMC trees?

17
The (surprising) results
  • The HMCT is on average less complex than one
    single SPT
  • HMCT has 24 nodes, SPTs on average 33 nodes
  • but youd need 250 of the latter to do the same
    job
  • The HMCT is on average slightly more accurate
    than a single SPT
  • (see graphs)
  • Surprising, as each SPT is tuned for one specific
    prediction task
  • Expectations w.r.t. efficiency are confirmed
  • Learning min. speedup factor 4.5x, max 65x,
    average 37x
  • Prediction gt250 times faster (since tree is not
    larger)
  • Faster to learn, much faster to apply

18
Precision recall curves
Precision proportion of predictions that is
correct P(X predicted X)
Recall proportion of class memberships correctly
identified P(predicted X X)
19
An example rule
  • High interpretability IF-THEN rules extracted
    from the HMCT are quite simple

IF Nitrogen_Depletion_8_h lt -2.74 AND
Nitrogen_Depletion_2_h gt -1.94 AND
1point5_mM_diamide_5_min gt -0.03 AND
1M_sorbitol___45_min_ gt -0.36 AND
37C_to_25C_shock___60_min gt 1.28 THEN 40, 40/3,
5, 5/1
For class 40/3 Recall 0.15 precision
0.97. (rule covers 15 of all class 40/3 cases,
and 97 of the cases fulfilling these conditions
are indeed 40/3)
20
The effect of merging
. . .
Optimized for c1
Optimized for c2
Optimized for c250
  • Smaller than average individual tree
  • - More accurate than average individual tree

Optimized for c1, c2, , c250
21
Any explanation for these results?
  • Almost too good to be true how is it possible?
  • Answer the classes are not independent
  • Different trees for different classes actually
    share structure
  • Explains some complexity reduction achieved by
    the HMCT
  • One class carries information on other classes
  • This increases the signal-to-noise ratio
  • Provides better guidance when learning the tree
    (explaining good accuracy)
  • Avoids overfitting (explaining further reduction
    of tree size)
  • This was confirmed empirically

22
Overfitting
  • To check our overfitting hypothesis
  • Compared area under PR curve on training set
    (Atr) and test set (Ate)
  • For SPC Atr Ate 0.219
  • For HMCT Atr Ate 0.024
  • (to verify, we tried Wekas M5 too 0.387)
  • So HMCT clearly overfits much less

23
Conclusions
  • Surprising discovery a single tree can be found
    that
  • predicts 250 different functions with, on
    average, equal or better accuracy than
    special-purpose trees for each function
  • is not more complex than a single special-purpose
    tree (hence, 250 times simpler than the whole
    set)
  • is (much) more efficient to learn and to apply
  • The reason for this is to be found in the
    dependencies between the gene functions
  • Provide better guidance when learning the tree
  • Help to avoid overfitting
Write a Comment
User Comments (0)
About PowerShow.com