# Model Selection and Related Topics - PowerPoint PPT Presentation

PPT – Model Selection and Related Topics PowerPoint presentation | free to view - id: 4cfa6-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Model Selection and Related Topics

Description:

### The Bayesian approach to statistical inference computes probability ... Interesting tidbit: posterior standard deviation of m given M0: 0.165 ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 149
Provided by: statRu
Category:
Tags:
Transcript and Presenter's Notes

Title: Model Selection and Related Topics

1
Overview Model Selection Theory Computing Continuo
us Model Selection
Model Selection and Related Topics
A mostly Bayesian perspective
2
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics
• The Bayesian approach to statistical inference
computes probability distributions for all
unknowns (model parameters, future observables,
etc.) conditional on the observed data
• Thus, denoting by q the unknowns, we compute
• p(q data) ? p(data q) x p(q)
• To play this game you need a prior, p(q), and a
likelihood, p(data q).

3
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Priors (per D.M. Titterington)
4
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayesian Basics
• This presentation will focus mostly on the
model p(data q)
• An idealization of the probabilistic process by
which mother nature generates the data
• Frequently, a data analyst will entertain several
models for the data p(data q1,M1), p(data
q2,M2), etc.
• This gives rise the model selection problem
(including model composition)

5
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors comparing two models/hypotheses
• Bayes Factors compare the posterior to prior odds
of one hypothesis to the posterior to prior odds
of another hypothesis

6
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors
• Jeffreys suggested the following scale for
interpreting the numerical value of a Bayes
Factor

(0.03)
7
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Interpretation of Bayes Factors
• Note that the Bayes Factor involves model
probabilities both prior, p(M), and posterior,
p(Mdata)
• p(M) is the probability that model M generated
the data
• What if we dont believe that any of the model
generated the data?

8
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)
• Here is density estimate for 100 univariate
observations, y1,,y100
• M0 yi N(m,t)
• M1 yi t(m,t,n)

9
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (cont.)
• Need to specify priors for everything

10
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors Example (Draper)
• M0 yi N(m,t)
• M1 yi t(m,t,n)
• Interesting tidbit
• posterior standard deviation of m given M0 0.165
• posterior standard deviation of m given M1 0.153
• (so model averaging can reduce the posterior
standard deviation)

11
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Improper Priors
• Using improper parameter priors means the Bayes
factor contains a ratio of unknown constants
• Lots of work trying to get around this
fractional Bayes factors, partial Bayes factors,
intrinsic Bayes factors, etc.
• Simpler solution use proper priors

12
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian basics Bayes factors
Bayes Factors and Model Probabilities
• Note that posterior models probabilities can
derive from all pairwise Bayes factors and
pairwise prior odds

13
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Selection
• If we believe that one of the candidate models
generated the data, then the predictively optimal
model has highest posterior probability
• This is also true for variable selection with
standard linear models when XTX is diagonal, s2
is known, and suitable priors are used (Clyde and
Parmigiani, 1996)

14
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
15
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
The Median Probability Model (Barbieri and
Berger, 2004)
16
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)
• Deviance is a standard measure of model fit
• Can summarize in two waysat posterior mean or
mode
• (1)
• or by averaging over the posterior
• (2)
• (2) will be bigger (i.e., worse) than (1)

17
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Deviance Information Criterion (DIC)
• is a measure of model complexity.
• In the normal linear model pD(1) equals the
number of parameters
• More generally pD(1) equals the number of
unconstrained parameters
• DIC
• Approximately optimizes predictive log loss

18
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Other Selection Criteria
• The training error rate
• will usually be less than the true error
• Typically work with error estimates of the form
• where is an estimate of the optimism

19
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Specific Selection Criteria
(squared error loss)
20
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Criteria - Theory
• BIC is consistent when the true model is fixed
• AIC is consistent if the dimensionality of the
true model increases with N at an appropriate
rate
• For standard linear models with known variance
AIC and Cp are essentially equivalent
• Folklore is that AIC tends to overfit

21
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation
• Since we dont usually believe that one of the
candidate models generated the data and
predictive accuracy on future data is key, many
authors argue in favor of cross-validation
• For example (Bernardo and Smith, 1984, 6.1.6)
select the model that maximizes
• where xn-1(j) represents the data with
observation xj removed and x1,xk is a random
sample from the data

22
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation
• Cross-validation give a slightly biased estimate
of future accuracy because it does not use all
the data
• Burman (1989) provides a bias correction

23
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection
• Important special case of model selection
• Which subset of X1,,Xd to use as predictors of
a response variable Y ?
• 2d possible models. For d30, there are 109
models. For d50, there are gt1015

24
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models
Here the Xs might be
• Raw predictor variables (continuous or
coded-categorical)
• Transformed predictors (X4log X3)
• Basis expansions (X4X32, X5X33, etc.)
• Interactions (X4X2 X3 )

Popular choice for estimation is least squares
25
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Variable Selection for Linear Models
• Standard all-subsets finds the subset of size
k, k1,,p, that minimizes RSS

• Choice of subset size requires tradeoff AIC,
BIC, marginal likelihood, cross-validation, etc.
• Leaps and bounds is an efficient algorithm to
do all-subsets up to about 40 variables

26
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Variable Selection
• Two key challenges
later)
• - Choosing a p(M) ? p(g) where g indexes models
• Many applications use p(M) ? 1 but this induces a
binomial distribution over model size
• Denison et al (1998) use a truncated Poisson
prior distribution for model size

27
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Selection Bias
• Selection bias is a significant unresolved issue
• Searching model space to find the best model
tends to overfit the data
• This holds even when using the close-to-unbiased
estimate of predictive performance that
cross-validation provides

28
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example
• Data
• Model

29
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinen example
30
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Vehtari and Lampinens Solution
• Select the simplest model that gives a predictive
distribution that is close to the BMA
predictive distribution
• Not obvious how to conduct this search in
high-dimensional problems

31
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Cross-validation and model complexity

One standard error rule pick the simplest model
within one standard error of the minimum
32
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Post-Model Selection Statistical Inference
• Conducting a data-driven model search and then
proceeding as if the search never took place
leads to biased and overconfident inferences
• Some non-Bayesian work on adjustment for model
selection (e.g., current issue of JASA)

33
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
34
Overview Model Selection Theory Computing Continuo
us Model Selection
Bayesian model selection Model scores Variable
selection Model averaging
Bayesian Model Averaging
• If we believe that one of the candidate models
generated the data, then the predictively optimal
strategy is to average over all the models.
• If Q is the inferential target, Bayesian Model
Averaging (BMA) computes
• Substantial empirical evidence that BMA provides
better prediction than model selection

35
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Laplaces Method for p(DM) ? p(Dq,M)p(qM)dq
(i.e., the log of the integrand divided by n)
then
and
where
is the posterior mode
36
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Tierney Kadane (1986, JASA) show the
approximation is O(n-1)
• Using the MLE instead of the posterior mode is
also O(n-1)
• Using the expected information matrix in ? is
O(n-1/2) but convenient since often computed by
standard software
• Raftery (1993) suggested approximating by a
single Newton step starting at the MLE
• Note the prior is explicit in these approximations

37
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?)
In practice has large variance
38
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Better Monte Carlo Estimates of p(DM)
Draw iid ?1,, ?m from p(?D)
Importance Sampling
39
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Newton and Rafterys Harmonic Mean Estimator
• Unstable in practice and needs modification

40
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Gibbs sampler output (Chibs method)
First note the following identity (for any ? )
p(D?) and p(?) are usually easy to evaluate.
Suppose we decompose ? into (?1,?2) such that
p(?1D,?2) and p(?2D,?1) are available in
closed-form
41
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
The Gibbs sampler gives (dependent) draws from
p(?1, ?2 D) and hence marginally from p(?2 D)
Rao-Blackwellization
42
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
OK
OK
?
To get these draws, continue the Gibbs sampler
sampling in turn from
and
43
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
p(DM) from Metropolis sampler output (Chib
Jeliazkov)
44
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
E1 with respect to ?y
E2 with respect to q(?, ?)
45
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Savage-Dickey Density Ratio
• Suppose M0 simplifies M1 by setting one parameter
(say q1) to some constant (typically zero)
• If p1(q2 q1 0) p0(q2) then

p(data M0)
p(q1 0 M1, data)

p(data M1)
p(q1 0 M1)
46
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Bayesian Information Criterion (BIC)
(SL is the negative log-likelihood)
• BIC is an O(1) approximation to p(DM)
• Circumvents explicit prior
• Approximation is O(n-1/2) for a class of priors
called unit information priors.
• No free lunch (Weakliem (1998) example)

47
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
48
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
49
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
50
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Computing Variable Selection via Stepwise Methods
• Efroymsons 1960 algorithm still the most widely
used

51
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Efroymson
• F-to-Enter
• F-to-Remove
• Guaranteed to converge
• Not guaranteed to converge to the right model

Distribution not even remotely like F
52
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Trouble
• Y X1 X2
• Y almost orthogonal to X1 and X2
• Forward selection and Efroymson pick X3 alone

53
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
More Trouble
• Berk Example with 4 variables
• The forward and backward sequence is (X1, X1X2,
X1X2 X4)
• The R2 for X1X2 is 0.015

54
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Even More Trouble
• Detroit example, N13, d11
• First variable selected in forward selection is
the first variable eliminated by backward
elimination
• Best subset of size 3 gives RSS of 6.8
• Forwards best set of 3 has RSS 21.2
Backwards gets 23.5

55
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Alternatives to all-subsets
• Simulated Annealing, Tabu Search, etc.

56
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
MCMC for Bayesian Variable Selection (Ioannis
Ntzoufras)
http//www.ba.aegean.gr/ntzoufras/courses/bugs2/ha
ndouts/modelsel/4_1_tutorial_handouts.pdf
57
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Why MCMC?
58
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
59
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Reversible Jump MCMC
60
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Stochastic Search Variable Selection (George
McCulloch)
61
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
SSVS Procedure
62
Estimating Spina Bifida Numbers with
Capture-Recapture Methods
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
(Regal Hook, 1991, Madigan York, 1996)

Model Pr(Model) N
95 HPD
B D R 0.37 731 (701,767)
B D R 0.30 756 (714,811)
B R D 0.28 712 (681,751)
BMA 728 (682,797)
63
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Spina Bifida 95 HPDs
M3
M2
M1
BMA
64
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
65
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging

66
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
67
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
68
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
69
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
70
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
71
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
72
Model Uncertainty
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Posterior Variance Within-Model
Variance Between-Model Variance
• Data-driven model selection is risky Part of
the evidence is spent specify the model (Leamer,
1978)
• Model-based inferences can be over-precise
• Model-based predictions can be badly calibrated
• Draper (1995), Chatfield (1995)

Bayesian Model Averaging (BMA) can help
73
Bayesian Model Averaging
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
74
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
75
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
76
Out-of-Sample Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Average Improvement
• Data
Model Class in Predictive Probability

• over MAP Model
• 1. Coronary Heart Disease Decomposable UDGs 2.2
• 2. Women and Mathematics Decomposable UDGs 0.6
• 3. Scrotal Swellings Decomposable UDGs 5.1
• 4. Crime and Punishment Linear Regression 61.3
• 5. Lung Cancer Exponential Regression 1.8
• 6. Cirrhosis Cox Survival Regression 1.8
• 7. Coronary Heart Disease Essential graphs 1.5
• 8. Women and Mathematics Essential graphs 3.0
• 9. Stroke Cox Survival Regression 15.0

77
BMA Computing
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Occams Window Find parsimonious models with
large Pr(MD) and average over those. Importance
Samping (Clyde et al., JASA, 1996) MCMCMC Use
MCMC to draw from Pr(MD).
78
Gibbs MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Undirected graphical models
SSVS (George and McCulloch, 1993)
• Choose vi, vj at random (or systematically) and
draw from

79
Metropolis MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. SVO Regression Outliers (Hoeting,
Possible Predictors a,b,c,d Possible Outliers
13,20,40 Current model
(b,c)(13,20) Candidate Models (b)(13,20) (
b,c)(13) (c)(13,20) (b,c)(20) (b,c,d)(13,20
) (b,c)(13,20,40) (a,b,c)(13,20)
Accept the Candidate Model with Prob
80
Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
e.g. Bayesian Networks
• A total ordering T of V is said to be compatible
with a directed graph M, if the orientation of
the arrows in M is consistent with T.
• Draw from Pr(T, M D)
• Pr(M T, D) Pr(T M, D)
• Uniform on compatible Ts
• Metropolis accept/reject
• Generate M by adding or deleting an edge from M
consistent with T
• Metropolis accept/reject

81
More Augmented MC3
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Draw from Pr(Z, M D)
• Pr(M Z, D) Pr(Z M, D)

e.g. Double Sampling Missing Data (York et al,
1994)
Pr(Z, qM M, D)
Pr(Z qM, M, D) Pr(qM Z, M, D)
Reversible Jump MCMC (Green, 1995)
82
Linear Regression SVT, SVO, SVOT
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Normal-gamma conjugate priors
• Box-Cox and ACE Transformations
• Outliers (pre-process with LMS
regression)

83
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
84
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
85
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
86
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
87
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
88
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
89
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
90
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
91
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
92
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
93
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Pr(Bi0D)
94
Generalized Linear Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Raftery (1996)
• Laplace Approximation

is minus the inverse Hessian of
evaluated at
• Idea approximate by one Newton step starting
from
• approximation using only GLIM output

95
Prior Distribution on Models
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Elicit Imaginary Data from the Expert
• Update pre-prior using imaginary data to get the
real prior, Pr(M)
• Provided improved predictive performance in a
particular medical example
• Ibrahim and Laud (1994)

96
Predicting Strokes
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Stroke is the third leading cause of death
• Gorelick (1995) estimated that 80 of strokes are
preventable
• Cardiovascular Health Study (CHS)
• On-going Started 1989 5,201 in four counties
• 65 risk factors different for older people?
• 172/4501 strokes

97
Measured Covariates
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Follow-up ranged from 3.5 to 4.5 years (Average
4.1)
98
Cox Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Estimation usually based on the Partial
Likelihood
(Taplin, 1993, Draper, 1995)
99
Finding the Models in Occams Window
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
• Need models with posterior probability within a
factor of C of MAP model
• Approximate Leaps and Bounds
• Furnival and Wilson (1974) Lawless and Singhal
(1978)
• Finds top q models of each size
• Find models within factor of C2
• Compute Exact BIC for these models

100
Picture of Probs vs P-values
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
101
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
102
Predictive Performance
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
Model Averaging Stepwise Top PMP
Stroke Stroke
Stroke
Low 751 7
750 8 724 10
Medium 770 24
799 27 801 28
High 645
617 51 641 48
Assigned Risk Group
103
Overview Model Selection Theory Computing Continuo
us Model Selection
Model probabilities BIC Variable selection Model
averaging
104
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Shrinkage Methods
• Subset selection is a discrete process
individual variables are either in or out.
Combinatorial nightmare.
• This method can have high variance a different
dataset from the same source can result in a
totally different model
• Shrinkage methods allow a variable to be partly
included in the model. That is, the variable is
included but with a shrunken co-efficient

105
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
works even when XTX is singular
106
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
107
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge Regression Bayesian MAP Regression
108
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Least Absolute Shrinkage Selection Operator
(LASSO)
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
109
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
110
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
111
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
112
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Ridge LASSO - Theory
• Lasso estimates are consistent
• But, Lasso does not have the oracle property.
That is, it does not deliver the correct model
with probability 1
• Fan Lis SCAD penalty function has the Oracle
property

113
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS
• New geometrical insights into Lasso and
Stagewise
• Leads to a highly efficient Lasso algorithm for
linear regression

114
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
LARS
• Find the predictor xj most correlated with y
• Increase bj in the direction of the sign of its
correlation with y. Take residuals ry-yhat along
the way. Stop when some other predictor xk has as
much correlation with r as xj has
• Increase (bj,bk) in their joint least squares
direction until some other predictor xm has as
much correlation with the residual r.
• Continue until all predictors are in the model

115
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
116
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Statistical Analysis of Textual Data
• Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems
• First (?) is Thomas C. Mendenhall in 1887

117
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Mendenhall was Professor of Physics at Ohio State
and at University of Tokyo, Superintendent of the
USA Coast and Geodetic Survey, and later,
President of Worcester Polytechnic Institute

118
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
X2 127.2, df12
119
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Used Naïve Bayes with Poisson and Negative
Binomial model
• Out-of-sample predictive performance

120
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
today
• Statistical methods routinely used for textual
analyses of all kinds
• Machine translation, part-of-speech tagging,
categorization, etc.
• Not reported in the statistical literature (no
statisticians?)

121
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Text categorization
• Automatic assignment of documents with respect to
manually defined set of categories
• Applications automated indexing, spam filtering,
content filters, medical coding, CRM, essay
• Dominant technology is supervised machine
learning
• Manually classify some documents, then learn a
classification rule from them (possibly with
manual intervention)

122
Document Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Documents usually represented as bag of words
• xis might be 0/1, counts, or weights (e.g.
tf/idf, LSI)
• Many text processing choices stopwords,
stemming, phrases, synonyms, NLP, etc.

123
Classifier Representation
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• For instance, linear classifier
• xis derived from text of document
• yi indicates whether to put document in category
• ßj are parameters chosen to give good
classification effectiveness

124
Logistic Regression Model
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Linear model for log odds of category membership
• Equivalent to
• Conditional probability model

125
Logistic Regression as a Linear Classifier
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• If estimated probability of category membership
is greater than p, assign document to category
• Choose p to optimize expected value of your
effectiveness measure
• Can change measure w/o changing model

126
Maximum Likelihood Training
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Choose parameters (ßj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xis)
• Maximizing (log-)likelihood can be viewed as
minimizing a loss function

127
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
Hastie, Friedman Tibshirani
128
Avoiding Overfitting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Text is high dimensional
• Maximum likelihood gives infinite parameter
values, poor effectiveness
• Solution penalize large ßj's, e.g. maximize
• Called ridge logistic regression

129
A Bayesian Interpretation of Ridge Logistic
Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Suppose
• we believe each ßj is a small value near 0
• and encode this belief as separate Gaussian
probability distributions over values of ßj
• Bayes rule specifies our new (posterior) belief
about ß after seeing training data
• Choosing maximum a posteriori value of the ß
gives same result as ridge logistic regression

130
Zhang Oles Results
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Reuters-21578 collection
• Ridge logistic regression plus feature selection

131
Bayes!
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• MAP logistic regression with Gaussian prior gives
state of the art text classification
effectiveness
• But Bayesian framework more flexible than SVM for
combining knowledge with data
• Feature selection
• Stopwords, IDF
• Domain knowledge
• Number of classes
• (and kernels.)

132
Bayesian Supervised Feature Selection
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Results on ridge logistic regression for text
classification use ad hoc feature selection
• Use of feature selection ? belief (before seeing
data) that many coefficients are 0
• Put that belief into our prior on coefficients...
• Laplace prior, i.e. lasso logistic regression

133
Data Sets
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• ModApte subset of Reuters-21578
• 90 categories 9603 training docs 18978 features
• Reuters RCV1-v2
• 103 cats 23149 training docs 47152 features
• OHSUMED heart disease categories
• 77 cats 83944 training docs 122076 features
• Cosine normalized TFxIDF weights

134
Dense vs. Sparse Models (Macroaveraged F1,
Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
135
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
136
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
137
Bayesian Unsupervised Feature Selection and
Weighting
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Stopwords low content words that typically are
• Give them a prior with mean 0 and low variance
• Inverse document frequency (IDF) weighting
• Rare words more likely to be content indicators
• Make variance of prior inversely proportional to
frequency in collection
• Experiments in progress

138
Bayesian Use of Domain Knowledge
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Often believe that certain words are positively
or negatively associated with category
• Prior mean can encode strength of positive or
negative association
• Prior variance encodes confidence

139
First Experiments
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• 27 RCV1-v2 Region categories
• CIA World Factbook entry for country
• Give content words higher mean and/or variance
• Only 10 training examples per category
• Shows off prior knowledge
• Limited data often the case in applications

140
Results (Preliminary)
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
141
Polytomous Logistic Regression
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Logistic regression trivially generalizes to
1-of-k problems
• Cleaner than SVMs, error correcting codes, etc.
• Laplace prior particularly cool here
• Suppose 99 classes and a word that predicts class
17
• Word gets used 100 times if build 100 models, or
if use polytomous with Gaussian prior
• With Laplace prior and polytomous it's used only
once
• Experiments in progress, particularly on author
id

142
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
143
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
1-of-K Sample Results brittany-l
4.6 million parameters
89 authors with at least 50 postings. 10,076
training documents, 3,322 test documents.
BMR-Laplace classification, default
hyperparameter
144
Future
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Choose exact number of features desired
• Faster training algorithm for polytomous
• Currently using cyclic coordinate descent
• Hierarchical models
• Sharing strength among categories
• Hierarchical relationships among features
• Stemming, thesaurus classes, phrases, etc.

145
Text Categorization Summary
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Conditional probability models (logistic,
probit, etc.)
• As powerful as other discriminative models (SVM,
boosting, etc.)
• Bayesian framework provides much richer ability
• Polytomous, domain-specific priors soon

146
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Regularized regression methods are better than
• Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense)
• L1-regularization is the best way to variable
selection

147
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Regularized regression methods are better than
• Regularized regression methods are more practical
than discrete model averaging (and probably make
more sense)
• L1-regularization is the best way to variable
selection

148
For high-dimensional predictive modeling
Overview Model Selection Theory Computing Continuo
us Model Selection
Ridge Lasso LARS Case Study
• Regularized regression methods are better than