Diversity in Ensemble Feature Selection - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Diversity in Ensemble Feature Selection

Description:

Accuracy and diversity in regression and classification ensembles ... artificial intelligence (multi-agent systems, Russel&Norvig 2002, ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 34
Provided by: tktl
Category:

less

Transcript and Presenter's Notes

Title: Diversity in Ensemble Feature Selection


1
Diversity in Ensemble Feature Selection
  • Alexey Tsymbal Department of Computer
    ScienceTrinity College Dublin (a paper
    submitted to Information Fusion with Pádraig
    Cunningham and Nick Pechenizkiy)

2
Contents
  • Introduction ensemble learning
  • Accuracy and diversity in regression and
    classification ensembles
  • Ensembles of classifiers and feature selection
    ensemble feature selection
  • search strategies in EFS HC, EFSS, EBSS, and
    GEFS
  • Diversity measures
  • Experimental results
  • Conclusions and future work

3
What is ensemble learning?
Ensemble learning refers to a collection of
methods that learn a target function by training
a number of individual learners and combining
their predictions
4
Ensemble learning
5
Ensembles scientific communities
  • machine learning (ML Research Four current
    directions, Dietterich 1997)
  • knowledge discovery and data mining
    (HanKamber 2000, DM Concepts and techniques,
    current trends)
  • artificial intelligence (multi-agent systems,
    RusselNorvig 2002, AI A modern approach,
    2nd ed.)
  • neural networks
  • statistics
  • computational learning theory
  • pattern recognition
  • what else?

6
Ensembles different names
  • multiple models
  • multiple classifier systems
  • combining classifiers (regressors etc)
  • integration of classifiers
  • mixture of experts
  • decision committee
  • committee of experts
  • classifer fusion
  • multimodel learning
  • consensus theory
  • what else?
  • base classifiers
  • component classifiers
  • individual classifiers
  • members (of a decision committee)
  • level 0 experts
  • what else?

7
Why ensemble learning?
  • Accuracy a more reliable mapping can be
    obtained by combining the output of multiple
    experts
  • Efficiency a complex problem can be decomposed
    into multiple sub-problems that are easier to
    understand and solve (divide-and-conquer
    approach). Mixture of experts, ensemble feature
    selection.
  • There is not a single model that works for all
    pattern recognition problems! (no free lunch
    theorem)To solve really hard problems, well
    have to use several different representations.
    It is time to stop arguing over which type of
    pattern-classification technique is best.
    Instead we should work at a higher level of
    organization and discover how to build managerial
    systems to exploit the different virtues abd
    evade the different limitations of each of these
    ways of comparing things. Minsky, 1991.

8
When ensemble learning?
  • When you can build base classifiers that are
    more accurate than chance, and, more importantly,
  • that are as much as possible independent from
    each other
  • (Accuracy and Diversity!)

9
How to make an effective ensemble?
  • Two basic decisions when designing ensembles
  • How to generate the base classifiers?
  • How to integrate them?

10
Methods for generating the base classifiers
  • Subsampling the training examples- multiple
    hypotheses are generated by training individual
    classifiers on different datasets obtained by
    resampling a common training set (Bagging,
    Boosting)
  • Manipulating the input features- multiple
    hypothesis are generated by training individual
    classifiers on different representations, or
    different subsets of a common feature vector
  • Manipulating the output targets- the output
    targets for C classes are encoded with an l-bit
    codeword, and an individual classifier is built
    to predict each one of the bits in the codeword-
    additional auxiliary targets may be used to
    differentiate classifiers
  • Modifying the learning parameters of the
    classifier- a number of classifiers are built
    with different learning parameters, such as
    number of neighbors in a kNN rule, initial
    weights in an MLP, etc
  • 5. Using heterogeneous models (not often used).

11
Ensembles the need for disagreement
  • Overall error depends on average error of
    ensemble members
  • Increasing ambiguity decreases overall error
  • Provided it does not result in an increase in
    average error
  • (Krogh and Vedelsby, 1995)

12
Measuring ensemble diversity
A is the ensemble ambiguity measured as the
weighted average of the squared differences in
the predictions of the base networks and the
ensemble (regression case)
Kuncheva, 2003 Yules Q statistic (1900)
13
Diversity metrics
  • Pairwise
  • plain disagreement
  • fail/ non-fail disagreement
  • Q statistic
  • kappa statistic
  • correlation coefficient
  • Non-pairwise
  • entropy
  • variance

14
Integration of classifiers
Integration
Selection
Combination
Dynamic Voting with Selection (DVS)
Static
Static
Dynamic
Dynamic
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
Dynamic Voting (DV)
Motivation for Dynamic Integration The main
assumption is that each classifier is the best in
some sub-areas of the whole data set, where its
local error is comparatively less than the
corresponding errors of the other classifiers.
15
Problems of global feature selection
  • Most feature selection methods ignore the fact
    that some features may be relevant only in
    context (i.e. in some regions of the instance
    space) and cover the whole instance space by a
    single set of features
  • They may discard features that are highly
    relevant in a restricted region of the instance
    space because this relevance is swamped by their
    irrelevance everywhere else
  • They may retain features that are relevant in
    most of the space, but still unnecessarily
    confuse the classifier in some regions
  • Global feature selection can lead to poor
    performance in minority class prediction, whereas
    this is an often case (i.e. many negative/no
    disease instances in medical diagnostics)
    (Cardie and Howe 1997).

16
Feature-space heterogeneity
  • There exist many complicated data mining
    problems, where relevant features are different
    in different regions of the feature space.
  • Types of feature heterogeneity
  • Class heterogenity
  • Feature-value heterogenity
  • Instance-space heterogeneity.

17
Ensemble Feature Selection
  • Goal of traditional feature selection
  • find and remove features that are unhelpful or
    destructive to learning making one feature subset
    for single classifier
  • Goal of ensemble feature selection
  • find and remove features that are unhelpful or
    destructive to learning making different feature
    subsets for a number of classifiers
  • find feature subsets that will promote
    disagreement between the classifiers

18
Search in EFS
  • Search space
  • 2NumOfFeaturesNumOfClassifiers 21825 6
    553 600
  • 4 search strategies to heuristically explore the
    search space
  • Hill-Climbing (HC)
  • Ensemble Forward Sequential Selection (EFSS)
  • Ensemble Backward Sequential Selection (EBSS)
  • Genetic Ensemble Feature Selection (GEFS)

19
Hill-Climbing (HC) strategy (CBMS2002)
  • Generation of initial feature subsets using the
    random subspace method (RSM)
  • A number of refining passes on eachfeature set
    while there is improvement in fitness

20
Ensemble Forward Sequential Selection (EFSS)
forward selection
21
Ensemble Backward Sequential Selection (EBSS)
.64
backward elimination
1,2,3,4
22
Genetic Ensemble Feature Selection (GEFS)
23
Computational complexity
EFSS and EBSS where S is the number of base
classifiers, N is the total number of features,
and N is the number of features included or
deleted on average in an FSS or BSS search.
Example EFSS 251831350 (and not 6 553
600!) HC where Npasses is the average number of
passes through the feature subsets in HC until
there is some improvement. GEFS where S is
the number of individuals (feature subsets) in
one generation, and Ngen is the number of
generations.
24
An Example EFSS on AAP III, alfa4
C1
C2
C3
f2 age
f6 severity of pain
f6 severity of pain
f7 location of pain at present
f13 severity of tenderness
f13 severity of tenderness
C4
C5
C6
f9 previous similar complaints
f3 progress of pain
f2 age
f14 movement of abdominal wall
f15 rigidity
f16 rectal tenderness
C7
C8
C9
f1 sex
f4 duration of pain
f4 duration of pain
f12 tenderness
f18 leukocytes
C10
f11 distended abdomen
25
Experiments results on AAP data sets
26
Search strategies on UCI data sets
27
Overfitting in EFS
28
The measures of total diversity
Table 3. Spearmans rank correlation coefficient
(RCC) for the total ensemble diversity and the
difference between the ensemble accuracy and the
average base classifier accuracy (average,
maximal and minimal values)
29
Diversity as a component of the fitness function
30
For more results and charts
(optimal alfa, best integration methods, the
neighborhood for dynamic integration, overfitting
in GA generations) see the journal
paper http//www.cs.tcd.ie/publications/tech-repor
ts/reports.03/TCD-CS-2003-44.pdf
31
Conclusions
  • 4 new search strategies proposed and analyzed,
    GA is very promising
  • 7 diversity measures
  • the best search strategy and diversity measure
    depends on the context of their use (domain, data
    set characteristics, etc.)
  • the best diversities on average disagreement
    (both), kappa entropy and variance

32
Future work
  • regression? Not much practical studies
  • other diversities (DF, double fault), and other
    search strategies (SA simulated annealing),
  • theoretical dependencies for EFS in crisp
    classification
  • automated prediction of the best search strategy
    and best diversity for a data set
  • huge data sets speech recognition (UC at
    Berkeley), text classification
  • closer investigation of GA as the best strategy
    on average
  • data streams, and tracking concept drifts

33
Thank you
  • Alexey.Tsymbal_at_cs.tcd.ie
  • http//www.cs.tcd.ie/publications/tech-reports/rep
    orts.03/TCD-CS-2003-44.pdf
Write a Comment
User Comments (0)
About PowerShow.com