Seeking Interpretable Models for High Dimensional Data

About This Presentation

Title:

Seeking Interpretable Models for High Dimensional Data

Description:

Consistency holds also for s and p growing with n, assuming ... Other methods applied to 500 features pre-selected by correlation. page 31. 9/19/09 ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 71

Provided by: gan106

Category:

more less

Transcript and Presenter's Notes

Title: Seeking Interpretable Models for High Dimensional Data

1
Seeking Interpretable Models forHigh Dimensional
Data
Bin Yu Statistics Department, EECS
Department University of California,
Berkeley http//www.stat.berkeley.edu/binyu
2
Characteristics of Modern Data Problems

Goal efficient use of data for
Prediction
Interpretation (proxy sparsity)
Larger number of variables
Number of variables (p) in data sets is large
Sample sizes (n) have not increased at same pace
Scientific opportunities
New findings in different scientific fields

3
Todays Talk

Understanding early visual cortex area V1 through
fMRI
Occams Razor
Lasso linear regression and Gaussian graphical
models
Learning about V1 through sparse models
(pre-selection via corr, linear, nonlinear,
graphical)
Future work

4
Understanding visual pathway
Gallant Lab at UCB is a leading vision
lab. Da Vinci (1452-1519)
Mapping of different visual cortex areas

PPoPolyak (1957)
Small left middle grey area V1
5
Understanding visual pathway through fMRI
One goal at Gallant Lab understand how
natural images relate to fMRI signals

6
Gallant Lab in Nature News

This article is part of Nature's premium content.
Published online 5 March 2008 Nature
doi10.1038/news.2008.650
Mind-reading with a brain scan
Brain activity can be decoded using magnetic
resonance imaging.
Kerri Smith
Scientists have developed a way of decoding
someones brain activity to determine what they
are looking at.
The problem is analogous to the classic pick
a card, any card magic trick, says Jack
Gallant, a neuroscientist at the University of
California in Berkeley, who led the study.

7
Stimuli

Natural image stimuli

8
Stimulus to fMRI response

Natural image stimuli drawn randomly from a
database of 11,499 images
Experiment designed so that response from
different presentations are nearly independent
fMRI response is pre-processed and roughly
Gaussian

9
Gabor Wavelet Pyramid
10
Features
11
Neural (fMRI) encoding for visual cortex V1

Predictor p10,921
features of an image
Response (preprocessed)
fMRI signal at a voxel
n1750 samples
Goal understanding human visual
system
interpretable (sparse)
model desired
good prediction is
necessary
Minimization of an empirical loss (e.g. L2)
leads to
ill-posed computational problem, and
bad prediction

12
Linear Encoding Model by Gallant Lab

Data
X p10921 dimensions (features)
Y fMRI signal
n 1750 training samples
Separate linear model for each voxel via
e-L2boosting (or Lasso)
Fitted model tested on 120 validation samples
Performance measured by correlation

13
Modeling history at Gallant Lab

Prediction on validation set is the benchmark
Methods tried neural nets, SVMs, e-L2boosting
(Lasso)
Among models with similar predictions, simpler
(sparser)
models by e-L2boosting are preferred for
interpretation
This practice reflects a general trend in
statistical machine
learning -- moving from prediction to
simpler/sparser
models for interpretation,
faster computation
or data transmission.

14
Occams Razor
14th-century English logician and Franciscan
friar, William of Ockham
Principle of Parsimony Entities must not be
multiplied beyond necessity.
Wikipedia
15
Occams Razor via Model Selection in Linear
Regression

Maximum likelihood (ML) is LS when Gaussian
assumption
There are submodels
ML goes for the largest submodel with all
predictors
Largest model often gives bad prediction for p
large

16
Model Selection Criteria
Akaike (73,74) and Mallows Cp used estimated
prediction error to choose a model Schwartz
(1980) Both are penalized LS by
. Rissanens Minimum Description Length (MDL)
principle gives rise to many different different
criteria. The two-part code leads to BIC.

17
Model Selection for image-fMRI problem

For the linear encoding model, the number of
submodels
Combinatorial search too expensive and
often not
necessary
A recent alternative
continuous embedding into a convex
optimization problem through L1 penalized LS
(Lasso) -- a third generation computational
method in statistics or machine learning.

18
Lasso L1-norm as a penalty

The L1 penalty is defined for coefficients ??
Used initially with L2 loss
Signal processing Basis Pursuit (Chen
Donoho,1994)
Statistics Non-Negative Garrote (Breiman, 1995)
Statistics LASSO (Tibshirani, 1996)
Properties of Lasso
Sparsity (variable selection) and regularization
Convexity (convex relaxation of L0-penalty)

19
Lasso computation and evaluation

The right tuning parameter unknown so
path is needed
(discretized or continuous)
Initially quadratic program for each a
grid on ?. QP is called
for each ?.
Later path following algorithms such
as
homotopy by Osborne et al
(2000)
LARS by Efron et al (2004)
Theoretical studies much work recently on
Lasso in terms of
L2 prediction error
L2 error of parameter
model selection consistency

20
Model Selection Consistency of Lasso

Set-up
Linear regression model
n observations and p predictors
Assume
(A)
Knight and Fu (2000) showed L2 estimation
consistency under (A).

21
Model Selection Consistency of Lasso

p small, n large (Zhao and Y, 2006), assume (A)
and
Then roughly Irrepresentable condition (1
by (p-s) matrix)
Some ambiguity when equality holds.
Related work Tropp(06), Meinshausen and
Buhlmann (06), Zou (06),
Wainwright (06)
Population version

model selection consistency
22
Irrepresentable condition (s2, p3) geomery

r0.4

23
Consistency of Lasso for Model Selection

Interpretation of Condition Regressing the
irrelevant predictors on the relevant predictors.
If the L1 norm of regression coefficients
()
Larger than 1, Lasso can not distinguish the
irrelevant predictor from the relevant predictors
for some parameter values.
Smaller than 1, Lasso can distinguish the
irrelevant predictor from the relevant
predictors.
Sufficient Conditions (Verifiable)
Constant correlation
Power decay correlation
Bounded correlation

24
Model Selection Consistency of Lasso (pgtgtn)

Consistency holds also for s and p growing with
n, assuming
bounds on max and min eigenvalues
of design matrix
smallest nonzero coefficient
bounded away from zero
irrepresentable condition
Gaussian noise (Wainwright, 06)
Finite 2k-th moment noise (ZhaoY,06)

25
Gaussian Graphical Model

Gaussian inverse covariance characterizes
conditional independencies (and hence graphical
model)

Graph
Inverse Covariance
26
L1 penalized log Gaussian Likelihood
Given n iid observations of X with Yuan
and Lin (06) Banerjee et al (08), Friedman
et al (07) fast algorithms
based on block descent. Ravikumar,
Wainwright, Raskutti, Yu (2008)
sufficient conditions for model selection
consistency (pgtgtn) for
sub-Gaussian and also heavy tail distributions

27
Success probs dependence on n and p (Gaussian)

Edge covariances as
Each point is an average over 100 trials.
Curves stack up in second plot, so that (n/log
p) controls model selection.

28
Success probs dependence on model complexity K
and n
Chain graph with p 120 nodes.

Curves from left to right have increasing values
of K.
Models with larger K thus require more samples n
for same probability of success.

29
Back to image-fMRI problem Linear sparse
encoding model on complex cells

Gallant Labs approach
Separate linear model for each voxel
Y Xb e
Model fitting via e-L2boosting and stopping by CV
X p10921 dimensions (features or complex
cells)
n 1750 training samples
Fitted model tested on 120 validation samples
(not used in fitting)
Performance measured by correlation (cc)

30
Our story on image-fMRI problem

Episode 1 linear
pre-selection by correlation
localization
Fitting details
Regularization parameter selected by 5-fold cross
validation
L2 boosting applied to all 10,000 features
-- L2 boosting is the method of choice
in Gallant Lab
Other methods applied to 500 features
pre-selected by correlation

31
Other methods

k 1 semi OLS (theoretically better than OLS)
k 0 ridge
k -1 semi OLS (inverse)
k 1,-1 semi-supervised because of the use of
population cov. , which is estimated from the
whole image data base.

32
Validation correlation
33
Comparison of the feature locations
Semi methods
L2boost
34
Our Story (cont)

Episode 2 nonlinear via iV-SpAM (machine
learning stage)
Additive Models (Hastie and Tibshirani, 1990)
Sparse
High dimensional p gtgtgt n
SpAM (Sparse Additve Models)
By Ravikumar, Lafferty, Liu, Wasserman (2007)
Related work COSSO, Lin and Zhang (2006)

for most j
page 34
2009?11?16????
35
SpAM V1 encoding model (1331 voxels from
V1)(Ravikumar, Vu, Gallant Lab, BY, NIPS08)

For each voxel,
Start with 10921 complex cells (features)
Pre-selection of 100 complex cells via
correlation
Fitting of SpAM to 100 complex cells with AIC
stopping
Pooling of SpAM fitted complex cells according
to location
and frequency to form pooled complex cells
Fitting SpAM to 100 complex cells and pooled
complex cells with AIC stopping

page 35
2009?11?16????
36
Prediction performance (R2)
inset region
median improvement 12.
page 36
2009?11?16????
37
Nonlinearities
Compressive effect (finite energy supply) Common
nonlinearity for each voxel?
page 37
2009?11?16????
38
Identical Nonlinearity

Identical V-SpAM model (iV-SpAM) (Ravikumar et
al, 08)
where
is small or sparse.

page 38
2009?11?16????
39
Identical-nonlinearity vs linearityR2
prediction
Median improvement 16
page 39
2009?11?16????
40
Episode 3 nonlinearity via power transformations
(classical stage)
Power transformation of features
Plot a predictor vs. the response The feature
is skewed The response is Gaussian. After
transformation (4th root) Get better
correlation!
Response
Feature
Response
Feature
page 40
16 November 2009
41
Make it Gaussian

Histograms of Gabor wavelets, by spatial
frequency.

Transformations
None
Log
4th root
Square root
Frequency
1
2
4
8
16
32
page 41
2009?11?16????
42
Gaussianization of features improves prediction

Gaussianized features used to train (using L2
epsilon-boosting)
FT F1/2 if F in levels 2,3 (4 x 4 and 8 x
8)
FT F1/4 if F in levels 4,5
(16 x 16 and 32 x 32)

page 42
2009?11?16????
43
Episode 4 localized predictors
Correlation of predictors with voxels Multi
resolution
Voxel 1
1
Voxel 2

Max correlation value at each location (of 8
orientations)
Purple strong (gt 0.4), red weak.
Speci?c strong location across resolutions, for
each voxel.

page 43
16 November 2009
44
Localized prediction

Each voxel is related to an area on the image
by
Locatization Algorithm
For each level 2 to 5
For each location
Find the orientation giving max correlation
Generates 4 maps 4 4 to 32 32
For each map
Smooth using a Gaussian kernel, and find peak
correlation
Levels 2,3 use only features from peak of
correlation (8 features each)
Levels 4,5 include 4,12 neighboring locations
(40,104 features)
On these features, train a linear prediction
functionusing L2 epsilon-boosting.
10 Times faster than global precision.
Similar prediction on almost all voxels.

page 44
2009?11?16????
45
After localization view responses of a single
voxel

Easy to distinguish stimuli that induce strong
vs. weak
responses in the voxel when we look at local area.

Location in full image
Strong
Weak
page 45
2009?11?16????
46
Localization does not work for all voxels
Generally prediction does not deteriorate. However
, there are a few voxels for which local
prediction is not as good.
Red Global-local gt 0.05 Blue Local-global gt
0.05
page 46
November 16, 2009
47
Who is responsible? (on-going)

Large regions of features correlated with outcome
The image features have correlations
Between close locations
Between levels of pyramid
Between neighboring gradients
Which of the features are responsible, and
which are
tagging along?
Can we distinguish between
Correlation between features, and
Causation of voxel response?

page 47
2009?11?16????
48
Some unlocalized voxels

Two neighboring voxels
Highly correlated
Each is influenced by both areas.
One influences the other

Area A
voxel1
voxel2
page 48
16 November 2009
49
Episode 5 sparse graphical models (on-going)

Estimated apart, each voxel depends on both areas
Together, each depends on only one area
Estimated inverse covariance entries shown
below

Voxel 1
Voxel 2
Area B
Area A
Area B
Area A
Estimated separately
Estimated together
page 49
2009?11?16????
50
Episode 6 building neighbor model (on-going)

Voxels with poor localization are probably
influenced by neighbor voxels
For each voxel 0, take 6 neighbor voxels
(v1,,6)
L(v) are local (tranformed) predictors for voxel
v
Build a sparse linear model for voxel v based on
L(v)
Add the 6 predicted values from the neighbor
models to
the local (transformed) predictors for voxel
0.
We call this model neighbor model.

page 50
2009?11?16????
51
Prediction (Neighbor compared to Global)
inset region
We dont lose prediction power
page 51
November 16, 2009
52
Summary

L1 penalized minimization model selection
consistency
Irrepresentable condition came from KKT and
L1 penalty.
Effective sample size n/logp? Not always.
Depending on the tail of relevant
distribution.
Model complexity parameters matter.

page 52
November 16, 2009
53
Summary (cont)

Understanding fMRI signals in V1 via sparse
modeling
Discovered nonlinear compressive effect of
V1 fMRI response
Sparse Gaussian models to relate selected
features and voxels
(on-going)
General lessons
Algorithmic EDA -- off-the-shelf sparse
modeling techniques
to reduce problem size and point to
nonlinearity
Subject-matter based models -- simpler (power
tranform)
and more meaningful models (local and
neighbor sparse graphical
models)

page 53
2009?11?16????
54
Future work

V1 voxels graphical model gives us a way to
borrow strength across voxels
improved predction based on graphical
models?
improved decoding? (to decode
images seen by the
subject using fMRI signals)
Higher vision area voxels no good models yet
from the lab
hope improved encoding models (hence
making decoding possible)
Next V4 challenge, that is, how to build
models for V4?
Better image representation

page 54
November 16, 2009
55
Acknowledgements

Co-authors P. Zhao and N. Meinshausen (Lasso) P.
Ravikumar, M. Wainwright, G. Raskutti (Graphical
model) P. Ravikumar, V. Vu, Y. Benjamini (fMRI
problem) Gallant Lab K. Kay, T. Naleris, J.
Gallant (fMRI problem) Funding NSF-IGMS Grant
(06-07 in Gallant Lab) Guggenheim Fellowship (06
in Gallant Lab) NSF DMS Grant NSF VIGRE Grant ARO
Grant
page 55
2009?11?16????
56
Shared-nonlinearity vs. nonlinearity sparsity
Recall 100 original features pre-selected for
V-SpAM or Shared V-SpAM. On
average Linear model 100 predictors selected
from 10,000 V-SpAM 70 predictors (original and
pooled) better prediction than
linear Shared V-SpAM sparser (13 predictors)
and better prediction
than V-SpAM
page 56
2009?11?16????
57
Using neighboring voxels for prediction

Voxels with poor localization are probably
influenced by neighbor voxels
Response of voxel v
L(v) are features indexes in a localized area of
the image
Xi are the (transformed) feature values
N(v) are the neighboring voxels (N(v) ? 6)
We will use

page 57
2009?11?16????
58
Using neighboring voxels for prediction

Algorithm
Phase 1
For every voxel
Find best location
Train a linear prediction function on localized
features(using L2 epsilon-boosting)
Generate predictions
(Until now Similar to Localized algorithm)
Phase 2
For every voxel
Find neighboring voxels in the brain
Retrain a linear prediction function on
localized-features and predictions for
neighboring voxels
Note
The models are nested Local ? Neighbor ? Global

page 58
2009?11?16????
59
Sparse cov. estimation via L1 penalty
Banerjee, El Ghaoui, dAspremont (08)
60
Model selection consistency (Ravikumar et al, 08)
Assume the irrepresentable condition below
holds 1. X sub-Gaussian with parameter
and effective sample size Or
2. X has 4m-th moment, Then with high
probability as n tends to infinity, the correct
model is chosen.
61
Model selection consistency
Ravikumar, Wainwright, Raskutti, Yu (08) gives
sufficient conditions for model selection
consistency. Hessian Define model
complexity
62
Gallant Lab in Nature News

This article is part of Nature's premium content.
Published online 5 March 2008 Nature
doi10.1038/news.2008.650
Mind-reading with a brain scan
Brain activity can be decoded using magnetic
resonance imaging.
Kerri Smith
Scientists have developed a way of decoding
someones brain activity to determine what they
are looking at.
The problem is analogous to the classic pick
a card, any card magic trick, says Jack
Gallant, a neuroscientist at the University of
California in Berkeley, who led the study.

page 62
November 16, 2009
63
Proof Techniques

Primal-dual witness
Useful to prove equalities (difficult) by only
showing inequalities (easier)
Fixed point analysis
Useful to bound a zero of a complicated
functional (e.g. gradient)

page 63
November 16, 2009
64
Primal-dual witness sketch

construct candidate primal-dual pair
proof technique not a practical algorithm

page 64
November 16, 2009
65
Sparsity of the Inverse Correlation (S)

In Multivariate Gaussian distribution
S signifies conditional independence
If Sij 0, then (i,j) are conditionally
independent (conditioned on all other variables)

Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 65
2009?11?16????
66
Sparsity of the Inverse Correlation (cont.)

We want to keep only true independent effect
Sparsity of structure in images is assumed.
Example (cont.)
Estimate Cor. and Inverse Cor. from 1000 samples

Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 66
2009?11?16????
67
Sparsity of the Inverse Correlation (S)

In multivariate gaussian distribution
S signifies conditional independence
If Sij 0, then (i,j) are conditionally
independent (conditioned on the rest of the
graph)

Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 67
16 November 2009
68
Sparsity of the Inverse Correlation (cont.)

We want to keep only true independent effect
Sparsity of structure in images is assumed.
Example (cont.)
Estimate Corr and inverse Corr from 1000 samples

Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 68
16 November 2009
69
Acknowledgements

Co-authors P. Ravikumar, M. Wainwright, G.
Raskutti (Graphical model) P. Ravikumar, V. Vu,
Y. Benjamini, K. Kay, T. Naleris, J.
Gallant Funding NSF-IGMS Grant (06-07 in
Gallant Lab) Guggenheim Fellowship (06) NSF DMS
Grant ARO Grant
page 69
November 16, 2009
70
Spatial RFs and Tuning Curves
page 70
November 16, 2009

Write a Comment

User Comments (0)