FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We - PowerPoint PPT Presentation

About This Presentation
Title:

FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We

Description:

Linear Programming SVM, Combined with State-of-Art Classifiers: ... Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 20
Provided by: mat125
Category:

less

Transcript and Presenter's Notes

Title: FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We


1
Feature/Model Selection by Linear Programming
SVM, Combined with State-of-Art Classifiers
What Can We Learn About the Data
  • Erinija Pranckeviciene, Ray Somorjai,
  • Institute for Biodiagnostics, NRC Canada,


2
Outline of the presentation
  • Description of the algorithm
  • Results on Agnostic Learning vs. Prior Knowledge
    (AL vs. PK) challenge datasets
  • Conclusions

3
Motivation to enter the Challenge
  • For small sample size / high dimensional
    datasets, the feature selection procedure will
    adapt to the peculiarities of the training
    dataset (sample bias).
  • An ideal model selection procedure would produce
    stable estimates of the classification error rate
    and the identities of discovered features would
    not vary much across the different random splits.
  • Our experiments with Linear Programming SVM
    (LP-SVM) on biomedical datasets produced results
    more robust to the sample bias and demonstrated
    the property stated above.
  • Decided to check LP-SVMs robustness property in
    a controlled experiment- an independent platform
    of the AL vs. PK challenge.

4
Classification with LP-SVM
  • The formulation of LP-SVM (known as Liknon,
    Bhattacharya et al.) is very similar to the
    conventional linear SVM, except for the objective
    function, which is linear, due to the L1 norm of
    the regularization term.
  • The solution of the LP-SVM is a linear
    discriminant, in which the weight magnitudes
    identify those original features that are
    important for class discrimination.
  • Different values of the regularization parameter
    C in the optimization problem produce different
    discriminants.

5
Outline of the algorithm
  • Available training data are processed in 10-fold
    stratified (based on the existing proportions of
    the class samples) crossvalidation
  • 9/10 of the data are for training, 1/10 for
    independent testing.
  • 2) The training portion is split randomly into
    balanced training and unbalanced monitoring sets.
  • 3) We perform 31 random splits.

6
Evolution of the models
  • The training set is used to find several LP-SVM
    discriminants, determined by the sequence of
    values of the regularization parameter C, in
    every split. Increasing C increases the number of
    features.
  • A balanced error rate (BER) for every
    discriminant is estimated on the monitoring set.
  • The discriminant / model with the smallest
    monitoring BER is retained.

7
Example of the evolution of the models on a
synthetic data
8
Feature profiles and other classifiers
  • In a single fold, a feature profile is derived,
    by counting the frequency of inclusion of the
    features in the best BER discriminant (we have 31
    best BER discriminants).
  • As a result, we have an ensemble of linear
    discriminants operating on the selected features
    and the feature profile is to be tested with
    other classifiers
  • (Several thresholds of the frequency of
    inclusion were examined for different datasets,
    to test state-of-art classification rules, such
    as KNN, fisher. Etc.).

9
Final model selection
  • The performance of all competing models derived
  • in a single fold is estimated by BER on the
    independent test set. Thus we have 10 estimates.
  • The final model is selected out of the 10
    estimated models.
  • The identities of the features occurring in all
    profiles can also be examined separately.

10
Experimental setup algorithmic parameters for AL
vs. PK datasets
  • T1T2 - size of the training set
  • M1M2 - size of the monitoring set
  • V1V2 - size of the validation set
  • Dim - dimensionality of the data
  • Models - number of the models tested
  • Th - threshold of the frequency of
    inclusion of feature in the feature profile.

11
ADA results
  • Identity the identities of the features
    occurring in all profiles
  • 100- 2, 8, 9, 18, 20, 24, 30

(Th 55)
12
GINA results
  • Identity the identities of the features
    occurring in all profiles
  • More than 85 - 367, 815, 510, 648, 424
  • Last 3 knn1 0.060, knn3 0.058, ens
    0.153

(Th 50)
13
HIVA results
Best entry ( former) 0.2939
  • Identity the identities of the features
    occurring in all profiles 90

Th 20
14
NOVA results
Best entry ( former) 0.0725
  • Identity the identities of the features
    occurring in all profiles 100

Th 80
15
SYLVA results
  • Identity the identities of the features
    occurring in all profiles
  • 100 - 202, 55

Th 20
16
Determination of C values
  • Given N1 and N2 measurements x of individual
    feature k in two classes, C value is
  • Sort the C values corresponding to d features in
    ascending order and solve a model for each.
  • The idea behind comes from the analysis of the
    dual.
  • If many features, then many models have to be
    solved. Computationally not feasible, C values
    have to be condensed.

17
Different ways of condensing C
  • The challenge submissions differed in how the C
    values were chosen.
  • Initially a histogram was used. Based on the
    final ranking this method worked out better for
    HIVA and NOVA.
  • In the last submissions a rate of change of a
    slightly modified objective function of primal
    was used. This worked better for ADA, GINA and
    SYLVA.
  • Still looking for a less heuristic and more
    precise method

18
Conclusions
  • The main advantages of our method are simplicity
    and the interpretability of the results. The
    disadvantage is high computational burden.
  • Ensembles tend to perform better than individual
    rules, except for GINA.
  • Same feature identities were consistently
    discovered in all splits and folds.
  • The derived feature identities have to be
    compared with the ground truth in the Prior
    Knowledge track.
  • Some arbitrariness, unavoidable in this
    experiment, will be dealt with in the future
    work- the threshold in feature profile, number of
    samples for training and monitoring, number of
    splits, number of the models.

19
Many thanks
  • To Muoi Tran for discussions and support,
  • For your attention!
Write a Comment
User Comments (0)
About PowerShow.com