Seeking Interpretable Models for High Dimensional Data - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Seeking Interpretable Models for High Dimensional Data

Description:

Consistency holds also for s and p growing with n, assuming ... Other methods applied to 500 features pre-selected by correlation. page 31. 9/19/09 ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 71
Provided by: gan106
Category:

less

Transcript and Presenter's Notes

Title: Seeking Interpretable Models for High Dimensional Data


1
Seeking Interpretable Models forHigh Dimensional
Data
Bin Yu Statistics Department, EECS
Department University of California,
Berkeley http//www.stat.berkeley.edu/binyu
2
Characteristics of Modern Data Problems
  • Goal efficient use of data for
  • Prediction
  • Interpretation (proxy sparsity)
  • Larger number of variables
  • Number of variables (p) in data sets is large
  • Sample sizes (n) have not increased at same pace
  • Scientific opportunities
  • New findings in different scientific fields

3
Todays Talk
  • Understanding early visual cortex area V1 through
    fMRI
  • Occams Razor
  • Lasso linear regression and Gaussian graphical
    models
  • Learning about V1 through sparse models
  • (pre-selection via corr, linear, nonlinear,
    graphical)
  • Future work

4
Understanding visual pathway
Gallant Lab at UCB is a leading vision
lab. Da Vinci (1452-1519)
Mapping of different visual cortex areas

PPoPolyak (1957)
Small left middle grey area V1
5
Understanding visual pathway through fMRI
One goal at Gallant Lab understand how
natural images relate to fMRI signals

6
Gallant Lab in Nature News
  • This article is part of Nature's premium content.
  • Published online 5 March 2008 Nature
    doi10.1038/news.2008.650
  • Mind-reading with a brain scan
  • Brain activity can be decoded using magnetic
    resonance imaging.
  • Kerri Smith
  • Scientists have developed a way of decoding
    someones brain activity to determine what they
    are looking at.
  • The problem is analogous to the classic pick
    a card, any card magic trick, says Jack
    Gallant, a neuroscientist at the University of
    California in Berkeley, who led the study.

7
Stimuli
  • Natural image stimuli

8
Stimulus to fMRI response
  • Natural image stimuli drawn randomly from a
    database of 11,499 images
  • Experiment designed so that response from
    different presentations are nearly independent
  • fMRI response is pre-processed and roughly
    Gaussian

9
Gabor Wavelet Pyramid
10
Features
11
Neural (fMRI) encoding for visual cortex V1
  • Predictor p10,921
    features of an image
  • Response (preprocessed)
    fMRI signal at a voxel

  • n1750 samples
  • Goal understanding human visual
    system
  • interpretable (sparse)
    model desired
  • good prediction is
    necessary
  • Minimization of an empirical loss (e.g. L2)
    leads to
  • ill-posed computational problem, and
  • bad prediction

12
Linear Encoding Model by Gallant Lab
  • Data
  • X p10921 dimensions (features)
  • Y fMRI signal
  • n 1750 training samples
  • Separate linear model for each voxel via
    e-L2boosting (or Lasso)
  • Fitted model tested on 120 validation samples
  • Performance measured by correlation

13
Modeling history at Gallant Lab
  • Prediction on validation set is the benchmark
  • Methods tried neural nets, SVMs, e-L2boosting
    (Lasso)
  • Among models with similar predictions, simpler
    (sparser)
  • models by e-L2boosting are preferred for
    interpretation
  • This practice reflects a general trend in
    statistical machine
  • learning -- moving from prediction to
    simpler/sparser
  • models for interpretation,
    faster computation
  • or data transmission.

14
Occams Razor
14th-century English logician and Franciscan
friar, William of Ockham
Principle of Parsimony Entities must not be
multiplied beyond necessity.
Wikipedia
15
Occams Razor via Model Selection in Linear
Regression
  • Maximum likelihood (ML) is LS when Gaussian
    assumption
  • There are submodels
  • ML goes for the largest submodel with all
    predictors
  • Largest model often gives bad prediction for p
    large

16
Model Selection Criteria
Akaike (73,74) and Mallows Cp used estimated
prediction error to choose a model Schwartz
(1980) Both are penalized LS by
. Rissanens Minimum Description Length (MDL)
principle gives rise to many different different
criteria. The two-part code leads to BIC.

17
Model Selection for image-fMRI problem
  • For the linear encoding model, the number of
    submodels
  • Combinatorial search too expensive and
  • often not
    necessary
  • A recent alternative
  • continuous embedding into a convex
    optimization problem through L1 penalized LS
    (Lasso) -- a third generation computational
    method in statistics or machine learning.

18
Lasso L1-norm as a penalty
  • The L1 penalty is defined for coefficients ??
  • Used initially with L2 loss
  • Signal processing Basis Pursuit (Chen
    Donoho,1994)
  • Statistics Non-Negative Garrote (Breiman, 1995)
  • Statistics LASSO (Tibshirani, 1996)
  • Properties of Lasso
  • Sparsity (variable selection) and regularization
  • Convexity (convex relaxation of L0-penalty)

19
Lasso computation and evaluation
  • The right tuning parameter unknown so
    path is needed
  • (discretized or continuous)
  • Initially quadratic program for each a
    grid on ?. QP is called
  • for each ?.
  • Later path following algorithms such
    as
  • homotopy by Osborne et al
    (2000)
  • LARS by Efron et al (2004)
  • Theoretical studies much work recently on
    Lasso in terms of
  • L2 prediction error
  • L2 error of parameter
  • model selection consistency

20
Model Selection Consistency of Lasso
  • Set-up
  • Linear regression model
  • n observations and p predictors
  • Assume
  • (A)
  • Knight and Fu (2000) showed L2 estimation
    consistency under (A).

21
Model Selection Consistency of Lasso
  • p small, n large (Zhao and Y, 2006), assume (A)
    and
  • Then roughly Irrepresentable condition (1
    by (p-s) matrix)
  • Some ambiguity when equality holds.
  • Related work Tropp(06), Meinshausen and
    Buhlmann (06), Zou (06),
  • Wainwright (06)
  • Population version

model selection consistency
22
Irrepresentable condition (s2, p3) geomery
  • r0.4

23
Consistency of Lasso for Model Selection
  • Interpretation of Condition Regressing the
    irrelevant predictors on the relevant predictors.
    If the L1 norm of regression coefficients
  • ()
  • Larger than 1, Lasso can not distinguish the
    irrelevant predictor from the relevant predictors
    for some parameter values.
  • Smaller than 1, Lasso can distinguish the
    irrelevant predictor from the relevant
    predictors.
  • Sufficient Conditions (Verifiable)
  • Constant correlation
  • Power decay correlation
  • Bounded correlation

24
Model Selection Consistency of Lasso (pgtgtn)
  • Consistency holds also for s and p growing with
    n, assuming
  • bounds on max and min eigenvalues
    of design matrix
  • smallest nonzero coefficient
    bounded away from zero
  • irrepresentable condition
  • Gaussian noise (Wainwright, 06)
  • Finite 2k-th moment noise (ZhaoY,06)

25
Gaussian Graphical Model
  • Gaussian inverse covariance characterizes
    conditional independencies (and hence graphical
    model)

Graph
Inverse Covariance
26
L1 penalized log Gaussian Likelihood
Given n iid observations of X with Yuan
and Lin (06) Banerjee et al (08), Friedman
et al (07) fast algorithms
based on block descent. Ravikumar,
Wainwright, Raskutti, Yu (2008)
sufficient conditions for model selection
consistency (pgtgtn) for
sub-Gaussian and also heavy tail distributions

27
Success probs dependence on n and p (Gaussian)
  • Edge covariances as
    Each point is an average over 100 trials.
  • Curves stack up in second plot, so that (n/log
    p) controls model selection.

28
Success probs dependence on model complexity K
and n
Chain graph with p 120 nodes.
  • Curves from left to right have increasing values
    of K.
  • Models with larger K thus require more samples n
    for same probability of success.

29
Back to image-fMRI problem Linear sparse
encoding model on complex cells
  • Gallant Labs approach
  • Separate linear model for each voxel
  • Y Xb e
  • Model fitting via e-L2boosting and stopping by CV
  • X p10921 dimensions (features or complex
    cells)
  • n 1750 training samples
  • Fitted model tested on 120 validation samples
    (not used in fitting)
  • Performance measured by correlation (cc)

30
Our story on image-fMRI problem
  • Episode 1 linear
  • pre-selection by correlation
  • localization
  • Fitting details
  • Regularization parameter selected by 5-fold cross
    validation
  • L2 boosting applied to all 10,000 features
  • -- L2 boosting is the method of choice
    in Gallant Lab
  • Other methods applied to 500 features
    pre-selected by correlation

31
Other methods
  • k 1 semi OLS (theoretically better than OLS)
  • k 0 ridge
  • k -1 semi OLS (inverse)
  • k 1,-1 semi-supervised because of the use of
  • population cov. , which is estimated from the
    whole image data base.

32
Validation correlation
33
Comparison of the feature locations
Semi methods
L2boost
34
Our Story (cont)
  • Episode 2 nonlinear via iV-SpAM (machine
    learning stage)
  • Additive Models (Hastie and Tibshirani, 1990)
  • Sparse
  • High dimensional p gtgtgt n
  • SpAM (Sparse Additve Models)
  • By Ravikumar, Lafferty, Liu, Wasserman (2007)
  • Related work COSSO, Lin and Zhang (2006)

for most j
page 34
2009?11?16????
35
SpAM V1 encoding model (1331 voxels from
V1)(Ravikumar, Vu, Gallant Lab, BY, NIPS08)
  • For each voxel,
  • Start with 10921 complex cells (features)
  • Pre-selection of 100 complex cells via
    correlation
  • Fitting of SpAM to 100 complex cells with AIC
    stopping
  • Pooling of SpAM fitted complex cells according
    to location
  • and frequency to form pooled complex cells
  • Fitting SpAM to 100 complex cells and pooled
    complex cells with AIC stopping

page 35
2009?11?16????
36
Prediction performance (R2)
inset region
median improvement 12.
page 36
2009?11?16????
37
Nonlinearities
Compressive effect (finite energy supply) Common
nonlinearity for each voxel?
page 37
2009?11?16????
38
Identical Nonlinearity
  • Identical V-SpAM model (iV-SpAM) (Ravikumar et
    al, 08)
  • where
    is small or sparse.

page 38
2009?11?16????
39
Identical-nonlinearity vs linearityR2
prediction
Median improvement 16
page 39
2009?11?16????
40
Episode 3 nonlinearity via power transformations
(classical stage)
Power transformation of features
Plot a predictor vs. the response The feature
is skewed The response is Gaussian. After
transformation (4th root) Get better
correlation!
Response
Feature
Response
Feature
page 40
16 November 2009
41
Make it Gaussian
  • Histograms of Gabor wavelets, by spatial
    frequency.

Transformations
None
Log
4th root
Square root
Frequency
1
2
4
8
16
32
page 41
2009?11?16????
42
Gaussianization of features improves prediction
  • Gaussianized features used to train (using L2
    epsilon-boosting)
  • FT F1/2 if F in levels 2,3 (4 x 4 and 8 x
    8)
  • FT F1/4 if F in levels 4,5
  • (16 x 16 and 32 x 32)

page 42
2009?11?16????
43
Episode 4 localized predictors
Correlation of predictors with voxels Multi
resolution
Voxel 1
1
Voxel 2
  • Max correlation value at each location (of 8
    orientations)
  • Purple strong (gt 0.4), red weak.
  • Speci?c strong location across resolutions, for
    each voxel.

page 43
16 November 2009
44
Localized prediction
  • Each voxel is related to an area on the image
    by
  • Locatization Algorithm
  • For each level 2 to 5
  • For each location
  • Find the orientation giving max correlation
  • Generates 4 maps 4 4 to 32 32
  • For each map
  • Smooth using a Gaussian kernel, and find peak
    correlation
  • Levels 2,3 use only features from peak of
    correlation (8 features each)
  • Levels 4,5 include 4,12 neighboring locations
    (40,104 features)
  • On these features, train a linear prediction
    functionusing L2 epsilon-boosting.
  • 10 Times faster than global precision.
  • Similar prediction on almost all voxels.

page 44
2009?11?16????
45
After localization view responses of a single
voxel
  • Easy to distinguish stimuli that induce strong
    vs. weak
  • responses in the voxel when we look at local area.

Location in full image
Strong
Weak
page 45
2009?11?16????
46
Localization does not work for all voxels
Generally prediction does not deteriorate. However
, there are a few voxels for which local
prediction is not as good.
Red Global-local gt 0.05 Blue Local-global gt
0.05
page 46
November 16, 2009
47
Who is responsible? (on-going)
  • Large regions of features correlated with outcome
  • The image features have correlations
  • Between close locations
  • Between levels of pyramid
  • Between neighboring gradients
  • Which of the features are responsible, and
    which are
  • tagging along?
  • Can we distinguish between
  • Correlation between features, and
  • Causation of voxel response?

page 47
2009?11?16????
48
Some unlocalized voxels
  • Two neighboring voxels
  • Highly correlated
  • Each is influenced by both areas.
  • One influences the other

Area A
voxel1
voxel2
page 48
16 November 2009
49
Episode 5 sparse graphical models (on-going)
  • Estimated apart, each voxel depends on both areas
  • Together, each depends on only one area
  • Estimated inverse covariance entries shown
    below

Voxel 1
Voxel 2
Area B
Area A
Area B
Area A
Estimated separately
Estimated together
page 49
2009?11?16????
50
Episode 6 building neighbor model (on-going)
  • Voxels with poor localization are probably
    influenced by neighbor voxels
  • For each voxel 0, take 6 neighbor voxels
    (v1,,6)
  • L(v) are local (tranformed) predictors for voxel
    v
  • Build a sparse linear model for voxel v based on
    L(v)
  • Add the 6 predicted values from the neighbor
    models to
  • the local (transformed) predictors for voxel
    0.
  • We call this model neighbor model.

page 50
2009?11?16????
51
Prediction (Neighbor compared to Global)
inset region
We dont lose prediction power
page 51
November 16, 2009
52
Summary
  • L1 penalized minimization model selection
    consistency
  • Irrepresentable condition came from KKT and
    L1 penalty.
  • Effective sample size n/logp? Not always.
  • Depending on the tail of relevant
    distribution.
  • Model complexity parameters matter.

page 52
November 16, 2009
53
Summary (cont)
  • Understanding fMRI signals in V1 via sparse
    modeling
  • Discovered nonlinear compressive effect of
    V1 fMRI response
  • Sparse Gaussian models to relate selected
    features and voxels
  • (on-going)
  • General lessons
  • Algorithmic EDA -- off-the-shelf sparse
    modeling techniques
  • to reduce problem size and point to
    nonlinearity
  • Subject-matter based models -- simpler (power
    tranform)
  • and more meaningful models (local and
    neighbor sparse graphical
  • models)

page 53
2009?11?16????
54
Future work
  • V1 voxels graphical model gives us a way to
    borrow strength across voxels
  • improved predction based on graphical
    models?
  • improved decoding? (to decode
  • images seen by the
    subject using fMRI signals)
  • Higher vision area voxels no good models yet
    from the lab
  • hope improved encoding models (hence
    making decoding possible)
  • Next V4 challenge, that is, how to build
    models for V4?
  • Better image representation

page 54
November 16, 2009
55
Acknowledgements

Co-authors P. Zhao and N. Meinshausen (Lasso) P.
Ravikumar, M. Wainwright, G. Raskutti (Graphical
model) P. Ravikumar, V. Vu, Y. Benjamini (fMRI
problem) Gallant Lab K. Kay, T. Naleris, J.
Gallant (fMRI problem) Funding NSF-IGMS Grant
(06-07 in Gallant Lab) Guggenheim Fellowship (06
in Gallant Lab) NSF DMS Grant NSF VIGRE Grant ARO
Grant
page 55
2009?11?16????
56
Shared-nonlinearity vs. nonlinearity sparsity
Recall 100 original features pre-selected for
V-SpAM or Shared V-SpAM. On
average Linear model 100 predictors selected
from 10,000 V-SpAM 70 predictors (original and
pooled) better prediction than
linear Shared V-SpAM sparser (13 predictors)
and better prediction
than V-SpAM
page 56
2009?11?16????
57
Using neighboring voxels for prediction
  • Voxels with poor localization are probably
    influenced by neighbor voxels
  • Response of voxel v
  • L(v) are features indexes in a localized area of
    the image
  • Xi are the (transformed) feature values
  • N(v) are the neighboring voxels (N(v) ? 6)
  • We will use

page 57
2009?11?16????
58
Using neighboring voxels for prediction
  • Algorithm
  • Phase 1
  • For every voxel
  • Find best location
  • Train a linear prediction function on localized
    features(using L2 epsilon-boosting)
  • Generate predictions
  • (Until now Similar to Localized algorithm)
  • Phase 2
  • For every voxel
  • Find neighboring voxels in the brain
  • Retrain a linear prediction function on
    localized-features and predictions for
    neighboring voxels
  • Note
  • The models are nested Local ? Neighbor ? Global

page 58
2009?11?16????
59
Sparse cov. estimation via L1 penalty
Banerjee, El Ghaoui, dAspremont (08)
60
Model selection consistency (Ravikumar et al, 08)
Assume the irrepresentable condition below
holds 1. X sub-Gaussian with parameter
and effective sample size Or
2. X has 4m-th moment, Then with high
probability as n tends to infinity, the correct
model is chosen.
61
Model selection consistency
Ravikumar, Wainwright, Raskutti, Yu (08) gives
sufficient conditions for model selection
consistency. Hessian Define model
complexity
62
Gallant Lab in Nature News
  • This article is part of Nature's premium content.
  • Published online 5 March 2008 Nature
    doi10.1038/news.2008.650
  • Mind-reading with a brain scan
  • Brain activity can be decoded using magnetic
    resonance imaging.
  • Kerri Smith
  • Scientists have developed a way of decoding
    someones brain activity to determine what they
    are looking at.
  • The problem is analogous to the classic pick
    a card, any card magic trick, says Jack
    Gallant, a neuroscientist at the University of
    California in Berkeley, who led the study.

page 62
November 16, 2009
63
Proof Techniques
  • Primal-dual witness
  • Useful to prove equalities (difficult) by only
    showing inequalities (easier)
  • Fixed point analysis
  • Useful to bound a zero of a complicated
    functional (e.g. gradient)

page 63
November 16, 2009
64
Primal-dual witness sketch
  • construct candidate primal-dual pair
  • proof technique not a practical algorithm

page 64
November 16, 2009
65
Sparsity of the Inverse Correlation (S)
  • In Multivariate Gaussian distribution
  • S signifies conditional independence
  • If Sij 0, then (i,j) are conditionally
    independent (conditioned on all other variables)

Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 65
2009?11?16????
66
Sparsity of the Inverse Correlation (cont.)
  • We want to keep only true independent effect
  • Sparsity of structure in images is assumed.
  • Example (cont.)
  • Estimate Cor. and Inverse Cor. from 1000 samples

Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 66
2009?11?16????
67
Sparsity of the Inverse Correlation (S)
  • In multivariate gaussian distribution
  • S signifies conditional independence
  • If Sij 0, then (i,j) are conditionally
    independent (conditioned on the rest of the
    graph)

Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 67
16 November 2009
68
Sparsity of the Inverse Correlation (cont.)
  • We want to keep only true independent effect
  • Sparsity of structure in images is assumed.
  • Example (cont.)
  • Estimate Corr and inverse Corr from 1000 samples

Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 68
16 November 2009
69
Acknowledgements

Co-authors P. Ravikumar, M. Wainwright, G.
Raskutti (Graphical model) P. Ravikumar, V. Vu,
Y. Benjamini, K. Kay, T. Naleris, J.
Gallant Funding NSF-IGMS Grant (06-07 in
Gallant Lab) Guggenheim Fellowship (06) NSF DMS
Grant ARO Grant
page 69
November 16, 2009
70
Spatial RFs and Tuning Curves
page 70
November 16, 2009
Write a Comment
User Comments (0)
About PowerShow.com