Guest lecture: Feature Selection - PowerPoint PPT Presentation

About This Presentation
Title:

Guest lecture: Feature Selection

Description:

... greedy FS & regularization. Classical Bayesian feature ... Regularization: use sparse prior to enhance the sparsity of a trained predictor (classifier) ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 28
Provided by: dis6
Category:

less

Transcript and Presenter's Notes

Title: Guest lecture: Feature Selection


1
Guest lecture Feature Selection
  • Alan Qi
  • yuanqi_at_mit.edu
  • Dec 2, 2004

2
Outline
  • Problems
  • Overview of feature selection (FS)
  • Filtering correlation information criteria
  • Wrapper approach greedy FS regularization
  • Classical Bayesian feature selection
  • New Bayesian approach predictive-ARD

3
Feature selection
  • Gene expression thousands of gene measurements
  • Documents bag of words model with more than
    10,000 words
  • Images histograms, colors, wavelet coefficients,
    etc.
  • Task find a small subset of features for
    prediction

4
Gene Expression Classification
Task Classify gene expression datasets into
different categories, e.g., normal v.s.
cancer Challenge Thousands of genes measured in
the micro-array data. Only a small subset of
genes are probably correlated with the
classification task.
5
Filtering approach
  • Feature ranking based on sensible criteria
  • Correlation between features and labels
  • Mutual information between features and labels

6
Wrapper Approach
  • Assess subsets of variables according to their
    usefulness to a given predictor. A combinatorial
    problem 2K combinations given K features.
  • Sequentially adding/removing features Sequential
    Forward Selection (SFS), Backward Sequential
    Selection (SBS).
  • Recursively adding/removing features Sequential
    Forward Floating Selection (SFFS) (When to stop?
    Overfitting?)
  • -Regularization use sparse prior to enhance the
    sparsity of a trained predictor (classifier).

7
Regularization
Labels t t1, t1, , tN Inputs X x1, x1,
, xN Parameters w Likelihood for the data set
(For classification)
Regularization combining the fit to the data and
a penalty for complexity. Minimizing the
following
8
Bayesian feature selection
  • Background
  • Bayesian classification model
  • Automatic relevance determination (ARD)
  • Risk of Overfitting by optimizing hyperparameters
  • Predictive ARD by expectation propagation (EP)
  • Approximate prediction error
  • EP approximation
  • Experiments
  • Conclusions

9
Motivation
  • Task 1 Classify high dimensional datasets with
    many irrelevant features, e.g., normal v.s.
    cancer microarray data.
  • Task 2 Sparse Bayesian kernel classifiers for
    fast test performance.

10
Bayesian Classification Model
Labels t inputs X parameters w Likelihood
for the data set
Prior of the classifier w
Where
is a cumulative distribution function for
a standard Gaussian.
11
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood of
the hyperparameters
The predictive posterior distribution of the
label for a new input
12
Automatic Relevance Determination (ARD)
  • Give the classifier weight independent Gaussian
    priors whose variance, , controls how far
    away from zero each weight is allowed to go
  • Maximize , the marginal likelihood of
    the model, with respect to .
  • Outcome many elements of go to infinity,
    which naturally prunes irrelevant features in the
    data.

13
Two Types of Overfitting
  • Classical Maximum likelihood
  • Optimizing the classifier weights w can directly
    fit noise in the data, resulting in a complicated
    model.
  • Type II Maximum likelihood (ARD)
  • Optimizing the hyperparameters corresponds to
    choosing which variables are irrelevant. Choosing
    one out of exponentially many models can also
    overfit if we maximize the model marginal
    likelihood.

14
Risk of Optimizing
  • X Class 1 vs O Class 2

15
Outline
  • Background
  • Bayesian classification model
  • Automatic relevance determination (ARD)
  • Risk of Overfitting by optimizing hyperparameters
  • Predictive ARD by expectation propagation (EP)
  • Approximate prediction error
  • EP approximation
  • Experiments
  • Conclusions

16
Predictive-ARD
  • Choosing the model with the best estimated
    predictive performance instead of the most
    probable model.
  • Expectation propagation (EP) estimates the
    leave-one-out predictive performance without
    performing any expensive cross-validation.

17
Estimate Predictive Performance
  • Predictive posterior given a test data point
  • EP can estimate predictive leave-one-out error
    probability
  • where q( w t\i) is the approximate posterior of
    leaving out the ith label.
  • EP can also estimate predictive leave-one-out
    error count

18
Expectation Propagation in a Nutshell
  • Approximate a probability distribution by
    simpler parametric terms
  • Each approximation term lives in an
    exponential family (e.g. Gaussian)

19
EP in a Nutshell
  • Three key steps
  • Deletion Step approximate the leave-one-out
    predictive posterior for the ith point
  • Minimizing the following KL divergence by moment
    matching
  • Inclusion

The key observation we can use the approximate
predictive posterior, obtained in the deletion
step, for model selection. No extra computation!
20
Comparison of different model selection criteria
for ARD training
The estimated leave-one-out error probabilities
and counts are better correlated with the test
error than evidence and sparsity level.
  • 1st row Test error
  • 2nd row Estimated leave-one-out error
    probability
  • 3rd row Estimated leave-one-out error counts
  • 4th row Evidence (Model marginal likelihood)
  • 5th row Fraction of selected features

21
Gene Expression Classification
  • Task Classify gene expression datasets into
    different categories, e.g., normal v.s. cancer
  • Challenge Thousands of genes measured in the
    micro-array data. Only a small subset of genes
    are probably correlated with the classification
    task.

22
Classifying Leukemia Data
  • The task distinguish acute myeloid leukemia
    (AML) from acute lymphoblastic leukemia (ALL).
  • The dataset 47 and 25 samples of type ALL and
    AML respectively with 7129 features per sample.
  • The dataset was randomly split 100 times into 36
    training and 36 testing samples.

23
Classifying Colon Cancer Data
  • The task distinguish normal and cancer samples
  • The dataset 22 normal and 40 cancer samples with
    2000 features per sample.
  • The dataset was randomly split 100 times into 50
    training and 12 testing samples.
  • SVM results from Li et al. 2002

24
Bayesian Sparse Kernel Classifiers
  • Using feature/kernel expansions defined on
    training data points
  • Predictive-ARD-EP trains a classifier that
    depends on a small subset of the training set.
  • Fast test performance.

25
Test error rates and numbers of relevance or
support vectors on breast cancer dataset.
  • 50 partitionings of the data were used. All
    these methods use the same Gaussian kernel with
    kernel width 5. The trade-off parameter C in
    SVM is chosen via 10-fold cross-validation for
    each partition.

26
Test error rates on diabetes data
  • 100 partitionings of the data were used.
    Evidence and Predictive ARD-EPs use the Gaussian
    kernel with kernel width 5.

27
Summary
  • Two kinds of feature selection methods
  • Filtering and wrapper methods
  • Classical Bayesian feature selection
  • Excellent classical approach Tuning prior to
    prune features.
  • However, maximizing marginal likelihood can lead
    to overfitting in the model space if there are a
    lot of features.
  • New Bayesian approach Predictive-ARD, which
    focus on the prediction performance.
  • feature selection
  • sparse kernel learning
Write a Comment
User Comments (0)
About PowerShow.com