Title: FeatureModel Selection by Linear Programming SVM, Combined with StateofArt Classifiers: What Can We
1Feature/Model Selection by Linear Programming
SVM, Combined with State-of-Art Classifiers
What Can We Learn About the Data
- Erinija Pranckeviciene, Ray Somorjai,
- Institute for Biodiagnostics, NRC Canada,
2Outline of the presentation
- Description of the algorithm
- Results on Agnostic Learning vs. Prior Knowledge
(AL vs. PK) challenge datasets - Conclusions
3Motivation to enter the Challenge
- For small sample size / high dimensional
datasets, the feature selection procedure will
adapt to the peculiarities of the training
dataset (sample bias). - An ideal model selection procedure would produce
stable estimates of the classification error rate
and the identities of discovered features would
not vary much across the different random splits. - Our experiments with Linear Programming SVM
(LP-SVM) on biomedical datasets produced results
more robust to the sample bias and demonstrated
the property stated above. - Decided to check LP-SVMs robustness property in
a controlled experiment- an independent platform
of the AL vs. PK challenge.
4Classification with LP-SVM
- The formulation of LP-SVM (known as Liknon,
Bhattacharya et al.) is very similar to the
conventional linear SVM, except for the objective
function, which is linear, due to the L1 norm of
the regularization term. - The solution of the LP-SVM is a linear
discriminant, in which the weight magnitudes
identify those original features that are
important for class discrimination. - Different values of the regularization parameter
C in the optimization problem produce different
discriminants.
5Outline of the algorithm
- Available training data are processed in 10-fold
stratified (based on the existing proportions of
the class samples) crossvalidation - 9/10 of the data are for training, 1/10 for
independent testing. - 2) The training portion is split randomly into
balanced training and unbalanced monitoring sets.
- 3) We perform 31 random splits.
6Evolution of the models
- The training set is used to find several LP-SVM
discriminants, determined by the sequence of
values of the regularization parameter C, in
every split. Increasing C increases the number of
features. - A balanced error rate (BER) for every
discriminant is estimated on the monitoring set. - The discriminant / model with the smallest
monitoring BER is retained.
7Example of the evolution of the models on a
synthetic data
8Feature profiles and other classifiers
- In a single fold, a feature profile is derived,
by counting the frequency of inclusion of the
features in the best BER discriminant (we have 31
best BER discriminants). - As a result, we have an ensemble of linear
discriminants operating on the selected features
and the feature profile is to be tested with
other classifiers - (Several thresholds of the frequency of
inclusion were examined for different datasets,
to test state-of-art classification rules, such
as KNN, fisher. Etc.).
9Final model selection
- The performance of all competing models derived
- in a single fold is estimated by BER on the
independent test set. Thus we have 10 estimates. - The final model is selected out of the 10
estimated models. - The identities of the features occurring in all
profiles can also be examined separately.
10Experimental setup algorithmic parameters for AL
vs. PK datasets
- T1T2 - size of the training set
- M1M2 - size of the monitoring set
- V1V2 - size of the validation set
- Dim - dimensionality of the data
- Models - number of the models tested
- Th - threshold of the frequency of
inclusion of feature in the feature profile.
11ADA results
- Identity the identities of the features
occurring in all profiles - 100- 2, 8, 9, 18, 20, 24, 30
(Th 55)
12GINA results
- Identity the identities of the features
occurring in all profiles - More than 85 - 367, 815, 510, 648, 424
- Last 3 knn1 0.060, knn3 0.058, ens
0.153
(Th 50)
13HIVA results
Best entry ( former) 0.2939
- Identity the identities of the features
occurring in all profiles 90
Th 20
14NOVA results
Best entry ( former) 0.0725
- Identity the identities of the features
occurring in all profiles 100
Th 80
15SYLVA results
- Identity the identities of the features
occurring in all profiles - 100 - 202, 55
Th 20
16Determination of C values
- Given N1 and N2 measurements x of individual
feature k in two classes, C value is - Sort the C values corresponding to d features in
ascending order and solve a model for each. - The idea behind comes from the analysis of the
dual. - If many features, then many models have to be
solved. Computationally not feasible, C values
have to be condensed.
17Different ways of condensing C
- The challenge submissions differed in how the C
values were chosen. - Initially a histogram was used. Based on the
final ranking this method worked out better for
HIVA and NOVA. - In the last submissions a rate of change of a
slightly modified objective function of primal
was used. This worked better for ADA, GINA and
SYLVA. - Still looking for a less heuristic and more
precise method
18Conclusions
- The main advantages of our method are simplicity
and the interpretability of the results. The
disadvantage is high computational burden. - Ensembles tend to perform better than individual
rules, except for GINA. - Same feature identities were consistently
discovered in all splits and folds. - The derived feature identities have to be
compared with the ground truth in the Prior
Knowledge track. - Some arbitrariness, unavoidable in this
experiment, will be dealt with in the future
work- the threshold in feature profile, number of
samples for training and monitoring, number of
splits, number of the models. -
19Many thanks
- To Muoi Tran for discussions and support,
- For your attention!