FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We - PowerPoint PPT Presentation

About This Presentation

Title:

FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We

Description:

Linear Programming SVM, Combined with State-of-Art Classifiers: ... Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 20

Provided by: mat125

Category:

more less

Transcript and Presenter's Notes

Title: FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We

1
Feature/Model Selection by Linear Programming
SVM, Combined with State-of-Art Classifiers
What Can We Learn About the Data

Erinija Pranckeviciene, Ray Somorjai,
Institute for Biodiagnostics, NRC Canada,

2
Outline of the presentation

Description of the algorithm
Results on Agnostic Learning vs. Prior Knowledge
(AL vs. PK) challenge datasets
Conclusions

3
Motivation to enter the Challenge

For small sample size / high dimensional
datasets, the feature selection procedure will
adapt to the peculiarities of the training
dataset (sample bias).
An ideal model selection procedure would produce
stable estimates of the classification error rate
and the identities of discovered features would
not vary much across the different random splits.
Our experiments with Linear Programming SVM
(LP-SVM) on biomedical datasets produced results
more robust to the sample bias and demonstrated
the property stated above.
Decided to check LP-SVMs robustness property in
a controlled experiment- an independent platform
of the AL vs. PK challenge.

4
Classification with LP-SVM

The formulation of LP-SVM (known as Liknon,
Bhattacharya et al.) is very similar to the
conventional linear SVM, except for the objective
function, which is linear, due to the L1 norm of
the regularization term.
The solution of the LP-SVM is a linear
discriminant, in which the weight magnitudes
identify those original features that are
important for class discrimination.
Different values of the regularization parameter
C in the optimization problem produce different
discriminants.

5
Outline of the algorithm

Available training data are processed in 10-fold
stratified (based on the existing proportions of
the class samples) crossvalidation
9/10 of the data are for training, 1/10 for
independent testing.
2) The training portion is split randomly into
balanced training and unbalanced monitoring sets.
3) We perform 31 random splits.

6
Evolution of the models

The training set is used to find several LP-SVM
discriminants, determined by the sequence of
values of the regularization parameter C, in
every split. Increasing C increases the number of
features.
A balanced error rate (BER) for every
discriminant is estimated on the monitoring set.
The discriminant / model with the smallest
monitoring BER is retained.

7
Example of the evolution of the models on a
synthetic data
8
Feature profiles and other classifiers

In a single fold, a feature profile is derived,
by counting the frequency of inclusion of the
features in the best BER discriminant (we have 31
best BER discriminants).
As a result, we have an ensemble of linear
discriminants operating on the selected features
and the feature profile is to be tested with
other classifiers
(Several thresholds of the frequency of
inclusion were examined for different datasets,
to test state-of-art classification rules, such
as KNN, fisher. Etc.).

9
Final model selection

The performance of all competing models derived
in a single fold is estimated by BER on the
independent test set. Thus we have 10 estimates.
The final model is selected out of the 10
estimated models.
The identities of the features occurring in all
profiles can also be examined separately.

10
Experimental setup algorithmic parameters for AL
vs. PK datasets

T1T2 - size of the training set
M1M2 - size of the monitoring set
V1V2 - size of the validation set
Dim - dimensionality of the data
Models - number of the models tested
Th - threshold of the frequency of
inclusion of feature in the feature profile.

11
ADA results

Identity the identities of the features
occurring in all profiles
100- 2, 8, 9, 18, 20, 24, 30

(Th 55)
12
GINA results

Identity the identities of the features
occurring in all profiles
More than 85 - 367, 815, 510, 648, 424
Last 3 knn1 0.060, knn3 0.058, ens
0.153

(Th 50)
13
HIVA results
Best entry ( former) 0.2939

Identity the identities of the features
occurring in all profiles 90

Th 20
14
NOVA results
Best entry ( former) 0.0725

Identity the identities of the features
occurring in all profiles 100

Th 80
15
SYLVA results

Identity the identities of the features
occurring in all profiles
100 - 202, 55

Th 20
16
Determination of C values

Given N1 and N2 measurements x of individual
feature k in two classes, C value is
Sort the C values corresponding to d features in
ascending order and solve a model for each.
The idea behind comes from the analysis of the
dual.
If many features, then many models have to be
solved. Computationally not feasible, C values
have to be condensed.

17
Different ways of condensing C

The challenge submissions differed in how the C
values were chosen.
Initially a histogram was used. Based on the
final ranking this method worked out better for
HIVA and NOVA.
In the last submissions a rate of change of a
slightly modified objective function of primal
was used. This worked better for ADA, GINA and
SYLVA.
Still looking for a less heuristic and more
precise method

18
Conclusions

The main advantages of our method are simplicity
and the interpretability of the results. The
disadvantage is high computational burden.
Ensembles tend to perform better than individual
rules, except for GINA.
Same feature identities were consistently
discovered in all splits and folds.
The derived feature identities have to be
compared with the ground truth in the Prior
Knowledge track.
Some arbitrariness, unavoidable in this
experiment, will be dealt with in the future
work- the threshold in feature profile, number of
samples for training and monitoring, number of
splits, number of the models.