Model Selection and Related Topics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Model Selection and Related Topics

1
Overview Model Selection Theory Computing Continuo
us Model Selection
Model Selection and Related Topics
A mostly Bayesian perspective
David Madigan Rutgers
2
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics

The Bayesian approach to statistical inference
computes probability distributions for all
unknowns (model parameters, future observables,
etc.) conditional on the observed data
Thus, denoting by q the unknowns, we compute
p(q data) ? p(data q) x p(q)
To play this game you need a prior, p(q), and a
likelihood, p(data q).

3
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Priors (per D.M. Titterington)
4
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics

This presentation will focus mostly on the
model p(data q)
An idealization of the probabilistic process by
which mother nature generates the data
Frequently, a data analyst will entertain several
models for the data p(data q1,M1), p(data
q2,M2), etc.
This gives rise the model selection problem
(including model composition)

5
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors comparing two models/hypotheses

Bayes Factors compare the posterior to prior odds
of one hypothesis to the posterior to prior odds
of another hypothesis

6
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors

Jeffreys suggested the following scale for
interpreting the numerical value of a Bayes
Factor

(0.03)
7
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors

Note that the Bayes Factor involves model
probabilities both prior, p(M), and posterior,
p(Mdata)
p(M) is the probability that model M generated
the data
What if we dont believe that any of the model
generated the data?

8
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)

Here is density estimate for 100 univariate
observations, y1,,y100
M0 yi N(m,t)
M1 yi t(m,t,n)

9
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (cont.)

Need to specify priors for everything

10
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)

M0 yi N(m,t)
M1 yi t(m,t,n)
K is about 0.04
Interesting tidbit
posterior standard deviation of m given M0 0.165
posterior standard deviation of m given M1 0.153
(so model averaging can reduce the posterior
standard deviation)

11
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Improper Priors

Using improper parameter priors means the Bayes
factor contains a ratio of unknown constants
Lots of work trying to get around this
fractional Bayes factors, partial Bayes factors,
intrinsic Bayes factors, etc.
Simpler solution use proper priors

12
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Model Probabilities

Note that posterior models probabilities can
derive from all pairwise Bayes factors and
pairwise prior odds

13
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Selection

If we believe that one of the candidate models
generated the data, then the predictively optimal
model has highest posterior probability
This is also true for variable selection with
standard linear models when XTX is diagonal, s2
is known, and suitable priors are used (Clyde and
Parmigiani, 1996)

14
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
15
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
16
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)

Deviance is a standard measure of model fit
Can summarize in two waysat posterior mean or
mode
(1)
or by averaging over the posterior
(2)
(2) will be bigger (i.e., worse) than (1)

17
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)

is a measure of model complexity.
In the normal linear model pD(1) equals the
number of parameters
More generally pD(1) equals the number of
unconstrained parameters
DIC
Approximately optimizes predictive log loss

18
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Other Selection Criteria

The training error rate
will usually be less than the true error
Typically work with error estimates of the form
where is an estimate of the optimism

19
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Specific Selection Criteria
(squared error loss)
20
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Criteria - Theory

BIC is consistent when the true model is fixed
AIC is consistent if the dimensionality of the
true model increases with N at an appropriate
rate
For standard linear models with known variance
AIC and Cp are essentially equivalent
Folklore is that AIC tends to overfit

21
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation

Since we dont usually believe that one of the
candidate models generated the data and
predictive accuracy on future data is key, many
authors argue in favor of cross-validation
For example (Bernardo and Smith, 1984, 6.1.6)
select the model that maximizes
where xn-1(j) represents the data with
observation xj removed and x1,xk is a random
sample from the data

22
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation

Cross-validation give a slightly biased estimate
of future accuracy because it does not use all
the data
Burman (1989) provides a bias correction

23
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection

Important special case of model selection
Which subset of X1,,Xd to use as predictors of
a response variable Y ?
2d possible models. For d30, there are 109
models. For d50, there are gt1015

24
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models
Here the Xs might be

Raw predictor variables (continuous or
coded-categorical)
Transformed predictors (X4log X3)
Basis expansions (X4X32, X5X33, etc.)
Interactions (X4X2 X3 )

Popular choice for estimation is least squares
25
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models

Standard all-subsets finds the subset of size
k, k1,,p, that minimizes RSS

Choice of subset size requires tradeoff AIC,
BIC, marginal likelihood, cross-validation, etc.
Leaps and bounds is an efficient algorithm to
do all-subsets up to about 40 variables

26
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Variable Selection

Two key challenges
- Exploring a space of 2d models (more about this
later)
- Choosing a p(M) ? p(g) where g indexes models
Many applications use p(M) ? 1 but this induces a
binomial distribution over model size
Denison et al (1998) use a truncated Poisson
prior distribution for model size

27
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Bias

Selection bias is a significant unresolved issue
Searching model space to find the best model
tends to overfit the data
This holds even when using the close-to-unbiased
estimate of predictive performance that
cross-validation provides

28
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example

Data

Model

29
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example
30
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinens Solution

Select the simplest model that gives a predictive
distribution that is close to the BMA
predictive distribution
Not obvious how to conduct this search in
high-dimensional problems

31
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation and model complexity

One standard error rule pick the simplest model
within one standard error of the minimum
32
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Post-Model Selection Statistical Inference

Conducting a data-driven model search and then
proceeding as if the search never took place
leads to biased and overconfident inferences
Some non-Bayesian work on adjustment for model
selection (e.g., current issue of JASA)

33
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
34
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Averaging

If we believe that one of the candidate models
generated the data, then the predictively optimal
strategy is to average over all the models.
If Q is the inferential target, Bayesian Model
Averaging (BMA) computes
Substantial empirical evidence that BMA provides
better prediction than model selection

35
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Laplaces Method for p(DM) ? p(Dq,M)p(qM)dq
(i.e., the log of the integrand divided by n)
then
and
where
is the posterior mode
36
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Tierney Kadane (1986, JASA) show the
approximation is O(n-1)
Using the MLE instead of the posterior mode is
also O(n-1)
Using the expected information matrix in ? is
O(n-1/2) but convenient since often computed by
standard software
Raftery (1993) suggested approximating by a
single Newton step starting at the MLE
Note the prior is explicit in these approximations

37
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?)
In practice has large variance
38
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Better Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?D)
Importance Sampling
39
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Newton and Rafterys Harmonic Mean Estimator
Unstable in practice and needs modification

40
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Gibbs sampler output (Chibs method)
First note the following identity (for any ? )
p(D?) and p(?) are usually easy to evaluate.
What about p(?D)?
Suppose we decompose ? into (?1,?2) such that
p(?1D,?2) and p(?2D,?1) are available in
closed-form
41
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
The Gibbs sampler gives (dependent) draws from
p(?1, ?2 D) and hence marginally from p(?2 D)
Rao-Blackwellization
42
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
What about three parameter blocks?
OK
OK
?
To get these draws, continue the Gibbs sampler
sampling in turn from
and
43
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Metropolis sampler output (Chib
Jeliazkov)
44
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
E1 with respect to ?y
E2 with respect to q(?, ?)
45
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Savage-Dickey Density Ratio

Suppose M0 simplifies M1 by setting one parameter
(say q1) to some constant (typically zero)
If p1(q2 q1 0) p0(q2) then

p(data M0)
p(q1 0 M1, data)

p(data M1)
p(q1 0 M1)
46
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Bayesian Information Criterion (BIC)
(SL is the negative log-likelihood)

BIC is an O(1) approximation to p(DM)
Circumvents explicit prior
Approximation is O(n-1/2) for a class of priors
called unit information priors.
No free lunch (Weakliem (1998) example)

47
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
48
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
49
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
50
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Computing Variable Selection via Stepwise Methods

Efroymsons 1960 algorithm still the most widely
used

51
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Efroymson

F-to-Enter
F-to-Remove
Guaranteed to converge
Not guaranteed to converge to the right model

Distribution not even remotely like F
52
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Trouble

Y X1 X2
Y almost orthogonal to X1 and X2
Forward selection and Efroymson pick X3 alone

53
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
More Trouble

Berk Example with 4 variables

The forward and backward sequence is (X1, X1X2,
X1X2 X4)
The R2 for X1X2 is 0.015

54
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Even More Trouble

Detroit example, N13, d11
First variable selected in forward selection is
the first variable eliminated by backward
elimination
Best subset of size 3 gives RSS of 6.8
Forwards best set of 3 has RSS 21.2
Backwards gets 23.5

55
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Alternatives to all-subsets

Simulated Annealing, Tabu Search, etc.

56
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
MCMC for Bayesian Variable Selection (Ioannis
Ntzoufras)
http//www.ba.aegean.gr/ntzoufras/courses/bugs2/ha
ndouts/modelsel/4_1_tutorial_handouts.pdf
57
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Why MCMC?
58
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
59
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
60
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Stochastic Search Variable Selection (George
McCulloch)
61
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
SSVS Procedure
62
Estimating Spina Bifida Numbers with
Capture-Recapture Methods
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
(Regal Hook, 1991, Madigan York, 1996)

Model Pr(Model) N
95 HPD
B D R 0.37 731 (701,767)
B D R 0.30 756 (714,811)
B R D 0.28 712 (681,751)
BMA 728 (682,797)
63
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Spina Bifida 95 HPDs
M3
M2
M1
BMA
64
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
65
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

66
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
67
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
68
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
69
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
70
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
71
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
72
Model Uncertainty
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Posterior Variance Within-Model
Variance Between-Model Variance
Data-driven model selection is risky Part of
the evidence is spent specify the model (Leamer,
1978)
Model-based inferences can be over-precise
Model-based predictions can be badly calibrated
Draper (1995), Chatfield (1995)

Bayesian Model Averaging (BMA) can help
73
Bayesian Model Averaging
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
74
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
75
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
76
Out-of-Sample Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Average Improvement
Data
Model Class in Predictive Probability
over MAP Model
1. Coronary Heart Disease Decomposable UDGs 2.2
2. Women and Mathematics Decomposable UDGs 0.6
3. Scrotal Swellings Decomposable UDGs 5.1
4. Crime and Punishment Linear Regression 61.3
5. Lung Cancer Exponential Regression 1.8
6. Cirrhosis Cox Survival Regression 1.8
7. Coronary Heart Disease Essential graphs 1.5
8. Women and Mathematics Essential graphs 3.0
9. Stroke Cox Survival Regression 15.0

77
BMA Computing
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Occams Window Find parsimonious models with
large Pr(MD) and average over those. Importance
Samping (Clyde et al., JASA, 1996) MCMCMC Use
MCMC to draw from Pr(MD).
Madigan and Raftery (1991)
Madigan and York (1992)
78
Gibbs MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Undirected graphical models
SSVS (George and McCulloch, 1993)

Choose vi, vj at random (or systematically) and
draw from

79
Metropolis MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. SVO Regression Outliers (Hoeting,
Raftery, Madigan, 1994,5,6)
Possible Predictors a,b,c,d Possible Outliers
13,20,40 Current model
(b,c)(13,20) Candidate Models (b)(13,20) (
b,c)(13) (c)(13,20) (b,c)(20) (b,c,d)(13,20
) (b,c)(13,20,40) (a,b,c)(13,20)
Accept the Candidate Model with Prob
80
Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Bayesian Networks

A total ordering T of V is said to be compatible
with a directed graph M, if the orientation of
the arrows in M is consistent with T.
Draw from Pr(T, M D)
Pr(M T, D) Pr(T M, D)

Uniform on compatible Ts
Metropolis accept/reject

Generate M by adding or deleting an edge from M
consistent with T
Metropolis accept/reject

81
More Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Draw from Pr(Z, M D)
Pr(M Z, D) Pr(Z M, D)

e.g. Double Sampling Missing Data (York et al,
1994)
Pr(Z, qM M, D)
Pr(Z qM, M, D) Pr(qM Z, M, D)
Reversible Jump MCMC (Green, 1995)
82
Linear Regression SVT, SVO, SVOT
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Hoeting, Raftery, Madigan (1994,5,6)

Normal-gamma conjugate priors
Box-Cox and ACE Transformations
Outliers (pre-process with LMS
regression)

83
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
84
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
85
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
86
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
87
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
88
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
89
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
90
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
91
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
92
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
93
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Pr(Bi0D)
94
Generalized Linear Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Raftery (1996)

Laplace Approximation

is minus the inverse Hessian of
evaluated at

Idea approximate by one Newton step starting
from
approximation using only GLIM output

95
Prior Distribution on Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Madigan, Gavrin, Raftery (1995)

Start with uniform pre-prior, PPr(M)
Elicit Imaginary Data from the Expert
Update pre-prior using imaginary data to get the
real prior, Pr(M)
Provided improved predictive performance in a
particular medical example
Ibrahim and Laud (1994)

96
Predicting Strokes
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
(Volinsky, Madigan, Raftery, Kronmal, 1996)

Stroke is the third leading cause of death
amongst adults in the US
Gorelick (1995) estimated that 80 of strokes are
preventable
Cardiovascular Health Study (CHS)
On-going Started 1989 5,201 in four counties
65 risk factors different for older people?
172/4501 strokes

97
Measured Covariates
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Follow-up ranged from 3.5 to 4.5 years (Average
4.1)
98
Cox Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Estimation usually based on the Partial
Likelihood
(Taplin, 1993, Draper, 1995)
99
Finding the Models in Occams Window
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

Need models with posterior probability within a
factor of C of MAP model
Approximate Leaps and Bounds
Furnival and Wilson (1974) Lawless and Singhal
(1978)
Finds top q models of each size
Find models within factor of C2
Compute Exact BIC for these models

100
Picture of Probs vs P-values
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
101
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
102
Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Model Averaging Stepwise Top PMP
Stroke Stroke
Stroke
Low 751 7
750 8 724 10
Medium 770 24
799 27 801 28
High 645
617 51 641 48
Assigned Risk Group
103
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
104
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Shrinkage Methods

Subset selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare.
This method can have high variance a different
dataset from the same source can result in a
totally different model
Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient

105
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
106
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
107
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression Bayesian MAP Regression
108
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Least Absolute Shrinkage Selection Operator
(LASSO)
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
109
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
110
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
111
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
112
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge LASSO - Theory

Lasso estimates are consistent
But, Lasso does not have the oracle property.
That is, it does not deliver the correct model
with probability 1
Fan Lis SCAD penalty function has the Oracle
property

113
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS

New geometrical insights into Lasso and
Stagewise
Leads to a highly efficient Lasso algorithm for
linear regression

114
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS

Start with all coefficients bj 0
Find the predictor xj most correlated with y
Increase bj in the direction of the sign of its
correlation with y. Take residuals ry-yhat along
the way. Stop when some other predictor xk has as
much correlation with r as xj has
Increase (bj,bk) in their joint least squares
direction until some other predictor xm has as
much correlation with the residual r.
Continue until all predictors are in the model

115
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
116
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Statistical Analysis of Textual Data

Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
First (?) is Thomas C. Mendenhall in 1887

117
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska
118
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
X2 127.2, df12
119
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Used Naïve Bayes with Poisson and Negative
Binomial model
Out-of-sample predictive performance

120
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
today

Statistical methods routinely used for textual
analyses of all kinds
Machine translation, part-of-speech tagging,
information extraction, question-answering, text
categorization, etc.
Not reported in the statistical literature (no
statisticians?)

121
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Text categorization

Automatic assignment of documents with respect to
manually defined set of categories
Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
grading
Dominant technology is supervised machine
learning
Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)

122
Document Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Documents usually represented as bag of words

xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI)
Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.

123
Classifier Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

For instance, linear classifier

xis derived from text of document
yi indicates whether to put document in category
ßj are parameters chosen to give good
classification effectiveness

124
Logistic Regression Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Linear model for log odds of category membership

Equivalent to

Conditional probability model

125
Logistic Regression as a Linear Classifier
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

If estimated probability of category membership
is greater than p, assign document to category

Choose p to optimize expected value of your
effectiveness measure
Can change measure w/o changing model

126
Maximum Likelihood Training
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Choose parameters (ßj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)

Maximizing (log-)likelihood can be viewed as
minimizing a loss function

127
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Hastie, Friedman Tibshirani
128
Avoiding Overfitting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Text is high dimensional
Maximum likelihood gives infinite parameter
values, poor effectiveness
Solution penalize large ßj's, e.g. maximize

Called ridge logistic regression

129
A Bayesian Interpretation of Ridge Logistic
Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Suppose
we believe each ßj is a small value near 0
and encode this belief as separate Gaussian
probability distributions over values of ßj
Bayes rule specifies our new (posterior) belief
about ß after seeing training data
Choosing maximum a posteriori value of the ß
gives same result as ridge logistic regression

130
Zhang Oles Results
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Reuters-21578 collection
Ridge logistic regression plus feature selection

131
Bayes!
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

MAP logistic regression with Gaussian prior gives
state of the art text classification
effectiveness
But Bayesian framework more flexible than SVM for
combining knowledge with data
Feature selection
Stopwords, IDF
Domain knowledge
Number of classes
(and kernels.)

132
Bayesian Supervised Feature Selection
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Results on ridge logistic regression for text
classification use ad hoc feature selection
Use of feature selection ? belief (before seeing
data) that many coefficients are 0
Put that belief into our prior on coefficients...
Laplace prior, i.e. lasso logistic regression

133
Data Sets
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

ModApte subset of Reuters-21578
90 categories 9603 training docs 18978 features
Reuters RCV1-v2
103 cats 23149 training docs 47152 features
OHSUMED heart disease categories
77 cats 83944 training docs 122076 features
Cosine normalized TFxIDF weights

134
Dense vs. Sparse Models (Macroaveraged F1,
Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
135
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
136
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
137
Bayesian Unsupervised Feature Selection and
Weighting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Stopwords low content words that typically are
discarded
Give them a prior with mean 0 and low variance
Inverse document frequency (IDF) weighting
Rare words more likely to be content indicators
Make variance of prior inversely proportional to
frequency in collection
Experiments in progress

138
Bayesian Use of Domain Knowledge
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Often believe that certain words are positively
or negatively associated with category
Prior mean can encode strength of positive or
negative association
Prior variance encodes confidence

139
First Experiments
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

27 RCV1-v2 Region categories
CIA World Factbook entry for country
Give content words higher mean and/or variance
Only 10 training examples per category
Shows off prior knowledge
Limited data often the case in applications

140
Results (Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
141
Polytomous Logistic Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Logistic regression trivially generalizes to
1-of-k problems
Cleaner than SVMs, error correcting codes, etc.
Laplace prior particularly cool here
Suppose 99 classes and a word that predicts class
17
Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior
With Laplace prior and polytomous it's used only
once
Experiments in progress, particularly on author
id

142
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
143
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
144
Future
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Choose exact number of features desired
Faster training algorithm for polytomous
Currently using cyclic coordinate descent
Hierarchical models
Sharing strength among categories
Hierarchical relationships among features
Stemming, thesaurus classes, phrases, etc.

145
Text Categorization Summary
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Conditional probability models (logistic,
probit, etc.)
As powerful as other discriminative models (SVM,
boosting, etc.)
Bayesian framework provides much richer ability
to insert task knowledge
Code http//stat.rutgers.edu/madigan/BBR
Polytomous, domain-specific priors soon

146
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Regularized regression methods are better than
ad-hoc variable selection algorithms
Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense)
L1-regularization is the best way to variable
selection

147
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Regularized regression methods are better than
ad-hoc variable selection algorithms
Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense)
L1-regularization is the best way to variable
selection

148
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study

Regularized regression methods are better than
ad-hoc variable selection algorithms
Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense)
L1-regularization is the best way to variable
selection

Write a Comment

User Comments (0)

About PowerShow.com

Model Selection and Related Topics PowerPoint PPT Presentation