Greedy Feature Grouping for Optimal Discriminant Subspaces - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Greedy Feature Grouping for Optimal Discriminant Subspaces

Description:

Discriminant information may well lie in a small subspace ... AML / ALL Leukaemia data. ALL. AML. 72 Patients. 200 Genes. 6 groups; random seeds. Conclusions ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Greedy Feature Grouping for Optimal Discriminant Subspaces


1
Greedy Feature Grouping for Optimal Discriminant
Subspaces
  • Mahesan Niranjan
  • Department of Computer Science
  • The University of Sheffield
  • European Bioinformatics Institute

2
Overview
  • Motivation
  • Feature Selection
  • Feature Grouping Algorithm
  • Simulations
  • Synthetic Data
  • Gene Expression Data
  • Conclusions and Future

3
Motivation
  • Many new high dimensional problems
  • Language processing
  • Synthetic chemical molecules
  • High throughput experiments in genomics
  • Discriminant information may well lie in a small
    subspace
  • Better classifiers
  • Better interpretation of classifier

4
Curse of dimensionality
Density estimation in high dimensions is difficult
5
Support Vector Machines
Classification, not density estimation
6
Support Vector MachinesNonlinear Kernel Functions
7
Classifier design
  • Usually to minimize error rate
  • Error rates can be misleading
  • Large imbalance in classes
  • Cost of misclassification can change

8
Adverse Outcome
x
Benign Outcome
x
x
Class Boundary
x
x
x
x
x
x
x
x
x
Threshold
9
True Positive
False Positive
Area under the ROC Curve Neat Statistical
Interpretation
10
Convex Hull of ROC Curves
True Positive
False Positive
Provost Fawcette Scott, Niranjan Prager
11
Feature selection in classification
  • Filters
  • select subset that scores high
  • Wrappers
  • Sequential Forward Selection / Backward deletion
  • Parcel
  • Scott, Niranjan Prager uses convex hulls of
    ROC curves

12
PARCEL Feature subset selection
  • Area under Convex Hull of multiple ROCs
  • Different classifier architectures (including
    different features) in different operating
    points.
  • Has been put to good use on independent
    implementations
  • Oxford, UCL, Surrey
  • Sheffield Speech Group

13
Gene Expression Microarrays
14
Inference problems in Microarray Data
  • Clustering
  • Similar expression patterns might imply
  • similar function
  • regulated in the same way
  • e.g. activated by the same transcription
    factor concentration maintained by same
    mechanism etc
  • Classification
  • diagnostics - e.g. disease / not
  • prediction - e.g. survival

? discrimination with features that do cluster
15
Subspaces of gene expressions
  • Singular Value Decomposition (SVD)
  • Robust SVD for missing values outliers
  • Combining different datasets
  • Pseudo-inverse Projection
  • Generalized SVD

Eigenarrays Eigengenes
Alter, Brown Botstein PNAS,
2000 Alter Golub PNAS, 2004
16
Yeast Gene Classification Switch to MATLAB
here
2000 yeast genes 79 experiments Ribosome /
Not (125) (1750) First use of SVM
Brown et al PNAS 1999
17
Discriminant Subspaces
18
Seemingly similar models
  • Product of Experts ( Hinton )
  • Modular Mixture Model ( Attias )
  • mixture model in subspaces
  • Combined by hidden nodes

Full feature set
None of these search for combinations of features
19
Algorithm
Select M Initial Assignment -- one feature
per group Sequential search through remaining
-- which feature, which group -- maximize
average AUROC / Sum of Fisher Ratios Stopping
criterion
At random / domain knowledge
20
Another view
Within Class Scatter
Separation of Means
21
Another view
22
Block diagonal scatter matrix
23
Simulations
  • Synthetic Data
  • 50 dimensions
  • 3 groups of 4 dimensions 38 irrelevant
  • ( random, but valid, covariance matrices )
  • 40 examples ( 20 / 20 )
  • 100 simulations
  • Results
  • 70 of the time correct features identified
  • Often incorrect group assignment

24
Simulations
  • Microarray Data Rosenwald et al 2002
  • cancer patients, survival after chemotherapy
  • 240 data points 7000 genes on array
  • (filtered to 1000
    genes)
  • two classes survived to 4 years / not
  • 10 ? 20 subspaces random initial seeds 60 runs
  • classification accuracy comparable to reported
    results
  • 10 of runs failed to group the features
  • ( discrimination based on single subspace )

25
AML / ALL Leukaemia data
26
Conclusions
  • Searching for feature groups
  • Desirable
  • Feasible
  • Achieves discriminant clustering
  • Next Step
  • Biological interpretation of groups
  • comparison of genes in clusters with
  • functional annotations, where known
  • ( gene ontology )
  • Careful initialization ( known pathways )
Write a Comment
User Comments (0)
About PowerShow.com