Loading...

PPT – Model Selection and Related Topics PowerPoint presentation | free to view - id: 4cfa6-ZDc1Z

The Adobe Flash plugin is needed to view this content

Overview Model Selection Theory Computing Continuo

us Model Selection

Model Selection and Related Topics

A mostly Bayesian perspective

David Madigan Rutgers

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayesian Basics

- The Bayesian approach to statistical inference

computes probability distributions for all

unknowns (model parameters, future observables,

etc.) conditional on the observed data - Thus, denoting by q the unknowns, we compute
- p(q data) ? p(data q) x p(q)
- To play this game you need a prior, p(q), and a

likelihood, p(data q).

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayesian Priors (per D.M. Titterington)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayesian Basics

- This presentation will focus mostly on the

model p(data q) - An idealization of the probabilistic process by

which mother nature generates the data - Frequently, a data analyst will entertain several

models for the data p(data q1,M1), p(data

q2,M2), etc. - This gives rise the model selection problem

(including model composition)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors comparing two models/hypotheses

- Bayes Factors compare the posterior to prior odds

of one hypothesis to the posterior to prior odds

of another hypothesis

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Interpretation of Bayes Factors

- Jeffreys suggested the following scale for

interpreting the numerical value of a Bayes

Factor

(0.03)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Interpretation of Bayes Factors

- Note that the Bayes Factor involves model

probabilities both prior, p(M), and posterior,

p(Mdata) - p(M) is the probability that model M generated

the data - What if we dont believe that any of the model

generated the data?

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors Example (Draper)

- Here is density estimate for 100 univariate

observations, y1, ,y100 - M0 yi N(m,t)
- M1 yi t(m,t,n)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors Example (cont.)

- Need to specify priors for everything

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors Example (Draper)

- M0 yi N(m,t)
- M1 yi t(m,t,n)
- K is about 0.04
- Interesting tidbit
- posterior standard deviation of m given M0 0.165
- posterior standard deviation of m given M1 0.153
- (so model averaging can reduce the posterior

standard deviation)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors and Improper Priors

- Using improper parameter priors means the Bayes

factor contains a ratio of unknown constants - Lots of work trying to get around this

fractional Bayes factors, partial Bayes factors,

intrinsic Bayes factors, etc. - Simpler solution use proper priors

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian basics Bayes factors

Bayes Factors and Model Probabilities

- Note that posterior models probabilities can

derive from all pairwise Bayes factors and

pairwise prior odds

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Bayesian Model Selection

- If we believe that one of the candidate models

generated the data, then the predictively optimal

model has highest posterior probability - This is also true for variable selection with

standard linear models when XTX is diagonal, s2

is known, and suitable priors are used (Clyde and

Parmigiani, 1996)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

The Median Probability Model (Barbieri and

Berger, 2004)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

The Median Probability Model (Barbieri and

Berger, 2004)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Deviance Information Criterion (DIC)

- Deviance is a standard measure of model fit
- Can summarize in two ways
at posterior mean or

mode - (1)
- or by averaging over the posterior
- (2)
- (2) will be bigger (i.e., worse) than (1)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Deviance Information Criterion (DIC)

- is a measure of model complexity.
- In the normal linear model pD(1) equals the

number of parameters - More generally pD(1) equals the number of

unconstrained parameters - DIC
- Approximately optimizes predictive log loss

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Other Selection Criteria

- The training error rate
- will usually be less than the true error
- Typically work with error estimates of the form
- where is an estimate of the optimism

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Specific Selection Criteria

(squared error loss)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Selection Criteria - Theory

- BIC is consistent when the true model is fixed
- AIC is consistent if the dimensionality of the

true model increases with N at an appropriate

rate - For standard linear models with known variance

AIC and Cp are essentially equivalent - Folklore is that AIC tends to overfit

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Cross-validation

- Since we dont usually believe that one of the

candidate models generated the data and

predictive accuracy on future data is key, many

authors argue in favor of cross-validation - For example (Bernardo and Smith, 1984, 6.1.6)

select the model that maximizes - where xn-1(j) represents the data with

observation xj removed and x1, xk is a random

sample from the data

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Cross-validation

- Cross-validation give a slightly biased estimate

of future accuracy because it does not use all

the data - Burman (1989) provides a bias correction

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Variable Selection

- Important special case of model selection
- Which subset of X1,
,Xd to use as predictors of

a response variable Y ? - 2d possible models. For d30, there are 109

models. For d50, there are gt1015

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Variable Selection for Linear Models

Here the Xs might be

- Raw predictor variables (continuous or

coded-categorical) - Transformed predictors (X4log X3)
- Basis expansions (X4X32, X5X33, etc.)
- Interactions (X4X2 X3 )

Popular choice for estimation is least squares

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Variable Selection for Linear Models

- Standard all-subsets finds the subset of size

k, k1, ,p, that minimizes RSS

- Choice of subset size requires tradeoff AIC,

BIC, marginal likelihood, cross-validation, etc. - Leaps and bounds is an efficient algorithm to

do all-subsets up to about 40 variables

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Bayesian Variable Selection

- Two key challenges
- - Exploring a space of 2d models (more about this

later) - - Choosing a p(M) ? p(g) where g indexes models
- Many applications use p(M) ? 1 but this induces a

binomial distribution over model size - Denison et al (1998) use a truncated Poisson

prior distribution for model size

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Selection Bias

- Selection bias is a significant unresolved issue
- Searching model space to find the best model

tends to overfit the data - This holds even when using the close-to-unbiased

estimate of predictive performance that

cross-validation provides

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Vehtari and Lampinen example

- Data

- Model

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Vehtari and Lampinen example

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Vehtari and Lampinens Solution

- Select the simplest model that gives a predictive

distribution that is close to the BMA

predictive distribution - Not obvious how to conduct this search in

high-dimensional problems

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Cross-validation and model complexity

One standard error rule pick the simplest model

within one standard error of the minimum

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Post-Model Selection Statistical Inference

- Conducting a data-driven model search and then

proceeding as if the search never took place

leads to biased and overconfident inferences - Some non-Bayesian work on adjustment for model

selection (e.g., current issue of JASA)

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Bayesian model selection Model scores Variable

selection Model averaging

Bayesian Model Averaging

- If we believe that one of the candidate models

generated the data, then the predictively optimal

strategy is to average over all the models. - If Q is the inferential target, Bayesian Model

Averaging (BMA) computes - Substantial empirical evidence that BMA provides

better prediction than model selection

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Laplaces Method for p(DM) ? p(Dq,M)p(qM)dq

(i.e., the log of the integrand divided by n)

then

and

where

is the posterior mode

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Tierney Kadane (1986, JASA) show the

approximation is O(n-1) - Using the MLE instead of the posterior mode is

also O(n-1) - Using the expected information matrix in ? is

O(n-1/2) but convenient since often computed by

standard software - Raftery (1993) suggested approximating by a

single Newton step starting at the MLE - Note the prior is explicit in these approximations

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Monte Carlo Estimates of p(DM)

Draw iid ?1, , ?m from p(?)

In practice has large variance

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Better Monte Carlo Estimates of p(DM)

Draw iid ?1, , ?m from p(?D)

Importance Sampling

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Newton and Rafterys Harmonic Mean Estimator
- Unstable in practice and needs modification

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

p(DM) from Gibbs sampler output (Chibs method)

First note the following identity (for any ? )

p(D?) and p(?) are usually easy to evaluate.

What about p(?D)?

Suppose we decompose ? into (?1,?2) such that

p(?1D,?2) and p(?2D,?1) are available in

closed-form

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

The Gibbs sampler gives (dependent) draws from

p(?1, ?2 D) and hence marginally from p(?2 D)

Rao-Blackwellization

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

What about three parameter blocks?

OK

OK

?

To get these draws, continue the Gibbs sampler

sampling in turn from

and

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

p(DM) from Metropolis sampler output (Chib

Jeliazkov)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

E1 with respect to ?y

E2 with respect to q(?, ?)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Savage-Dickey Density Ratio

- Suppose M0 simplifies M1 by setting one parameter

(say q1) to some constant (typically zero) - If p1(q2 q1 0) p0(q2) then

p(data M0)

p(q1 0 M1, data)

p(data M1)

p(q1 0 M1)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Bayesian Information Criterion (BIC)

(SL is the negative log-likelihood)

- BIC is an O(1) approximation to p(DM)
- Circumvents explicit prior
- Approximation is O(n-1/2) for a class of priors

called unit information priors. - No free lunch (Weakliem (1998) example)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Computing Variable Selection via Stepwise Methods

- Efroymsons 1960 algorithm still the most widely

used

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Efroymson

- F-to-Enter
- F-to-Remove
- Guaranteed to converge
- Not guaranteed to converge to the right model

Distribution not even remotely like F

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Trouble

- Y X1 X2
- Y almost orthogonal to X1 and X2
- Forward selection and Efroymson pick X3 alone

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

More Trouble

- Berk Example with 4 variables

- The forward and backward sequence is (X1, X1X2,

X1X2 X4) - The R2 for X1X2 is 0.015

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Even More Trouble

- Detroit example, N13, d11
- First variable selected in forward selection is

the first variable eliminated by backward

elimination - Best subset of size 3 gives RSS of 6.8
- Forwards best set of 3 has RSS 21.2

Backwards gets 23.5

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Alternatives to all-subsets

- Simulated Annealing, Tabu Search, etc.

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

MCMC for Bayesian Variable Selection (Ioannis

Ntzoufras)

http//www.ba.aegean.gr/ntzoufras/courses/bugs2/ha

ndouts/modelsel/4_1_tutorial_handouts.pdf

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Why MCMC?

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Reversible Jump MCMC

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Reversible Jump MCMC

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Stochastic Search Variable Selection (George

McCulloch)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

SSVS Procedure

Estimating Spina Bifida Numbers with

Capture-Recapture Methods

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

(Regal Hook, 1991, Madigan York, 1996)

Model Pr(Model) N

95 HPD

B D R 0.37 731 (701,767)

B D R 0.30 756 (714,811)

B R D 0.28 712 (681,751)

BMA 728 (682,797)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Spina Bifida 95 HPDs

M3

M2

M1

BMA

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Model Uncertainty

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Posterior Variance Within-Model

Variance Between-Model Variance - Data-driven model selection is risky Part of

the evidence is spent specify the model (Leamer,

1978) - Model-based inferences can be over-precise
- Model-based predictions can be badly calibrated
- Draper (1995), Chatfield (1995)

Bayesian Model Averaging (BMA) can help

Bayesian Model Averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Out-of-Sample Predictive Performance

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Average Improvement
- Data

Model Class in Predictive Probability -

over MAP Model - 1. Coronary Heart Disease Decomposable UDGs 2.2
- 2. Women and Mathematics Decomposable UDGs 0.6
- 3. Scrotal Swellings Decomposable UDGs 5.1
- 4. Crime and Punishment Linear Regression 61.3
- 5. Lung Cancer Exponential Regression 1.8
- 6. Cirrhosis Cox Survival Regression 1.8
- 7. Coronary Heart Disease Essential graphs 1.5
- 8. Women and Mathematics Essential graphs 3.0
- 9. Stroke Cox Survival Regression 15.0

BMA Computing

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Occams Window Find parsimonious models with

large Pr(MD) and average over those. Importance

Samping (Clyde et al., JASA, 1996) MCMCMC Use

MCMC to draw from Pr(MD).

Madigan and Raftery (1991)

Madigan and York (1992)

Gibbs MC3

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

e.g. Undirected graphical models

SSVS (George and McCulloch, 1993)

- Choose vi, vj at random (or systematically) and

draw from

Metropolis MC3

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

e.g. SVO Regression Outliers (Hoeting,

Raftery, Madigan, 1994,5,6)

Possible Predictors a,b,c,d Possible Outliers

13,20,40 Current model

(b,c)(13,20) Candidate Models (b)(13,20) (

b,c)(13) (c)(13,20) (b,c)(20) (b,c,d)(13,20

) (b,c)(13,20,40) (a,b,c)(13,20)

Accept the Candidate Model with Prob

Augmented MC3

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

e.g. Bayesian Networks

- A total ordering T of V is said to be compatible

with a directed graph M, if the orientation of

the arrows in M is consistent with T. - Draw from Pr(T, M D)
- Pr(M T, D) Pr(T M, D)

- Uniform on compatible Ts
- Metropolis accept/reject

- Generate M by adding or deleting an edge from M

consistent with T - Metropolis accept/reject

More Augmented MC3

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Draw from Pr(Z, M D)
- Pr(M Z, D) Pr(Z M, D)

e.g. Double Sampling Missing Data (York et al,

1994)

Pr(Z, qM M, D)

Pr(Z qM, M, D) Pr(qM Z, M, D)

Reversible Jump MCMC (Green, 1995)

Linear Regression SVT, SVO, SVOT

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Hoeting, Raftery, Madigan (1994,5,6)

- Normal-gamma conjugate priors
- Box-Cox and ACE Transformations
- Outliers (pre-process with LMS

regression)

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Pr(Bi0D)

Generalized Linear Models

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Raftery (1996)

- Laplace Approximation

is minus the inverse Hessian of

evaluated at

- Idea approximate by one Newton step starting

from - approximation using only GLIM output

Prior Distribution on Models

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Madigan, Gavrin, Raftery (1995)

- Start with uniform pre-prior, PPr(M)
- Elicit Imaginary Data from the Expert
- Update pre-prior using imaginary data to get the

real prior, Pr(M) - Provided improved predictive performance in a

particular medical example - Ibrahim and Laud (1994)

Predicting Strokes

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

(Volinsky, Madigan, Raftery, Kronmal, 1996)

- Stroke is the third leading cause of death

amongst adults in the US - Gorelick (1995) estimated that 80 of strokes are

preventable - Cardiovascular Health Study (CHS)
- On-going Started 1989 5,201 in four counties
- 65 risk factors different for older people?
- 172/4501 strokes

Measured Covariates

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Follow-up ranged from 3.5 to 4.5 years (Average

4.1)

Cox Model

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Estimation usually based on the Partial

Likelihood

(Taplin, 1993, Draper, 1995)

Finding the Models in Occams Window

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

- Need models with posterior probability within a

factor of C of MAP model - Approximate Leaps and Bounds
- Furnival and Wilson (1974) Lawless and Singhal

(1978) - Finds top q models of each size
- Find models within factor of C2
- Compute Exact BIC for these models

Picture of Probs vs P-values

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Predictive Performance

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Model Averaging Stepwise Top PMP

Stroke Stroke

Stroke

Low 751 7

750 8 724 10

Medium 770 24

799 27 801 28

High 645

617 51 641 48

Assigned Risk Group

Overview Model Selection Theory Computing Continuo

us Model Selection

Model probabilities BIC Variable selection Model

averaging

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Shrinkage Methods

- Subset selection is a discrete process

individual variables are either in or out.

Combinatorial nightmare. - This method can have high variance a different

dataset from the same source can result in a

totally different model - Shrinkage methods allow a variable to be partly

included in the model. That is, the variable is

included but with a shrunken co-efficient

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Ridge Regression

subject to

Equivalently

This leads to Choose ? by cross-validation.

works even when XTX is singular

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Ridge Regression Bayesian MAP Regression

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Least Absolute Shrinkage Selection Operator

(LASSO)

subject to

Quadratic programming algorithm needed to solve

for the parameter estimates

q0 var. sel. q1 lasso q2 ridge Learn q?

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Ridge LASSO - Theory

- Lasso estimates are consistent
- But, Lasso does not have the oracle property.

That is, it does not deliver the correct model

with probability 1 - Fan Lis SCAD penalty function has the Oracle

property

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

LARS

- New geometrical insights into Lasso and

Stagewise - Leads to a highly efficient Lasso algorithm for

linear regression

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

LARS

- Start with all coefficients bj 0
- Find the predictor xj most correlated with y
- Increase bj in the direction of the sign of its

correlation with y. Take residuals ry-yhat along

the way. Stop when some other predictor xk has as

much correlation with r as xj has - Increase (bj,bk) in their joint least squares

direction until some other predictor xm has as

much correlation with the residual r. - Continue until all predictors are in the model

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Statistical Analysis of Textual Data

- Statistical text analysis has a long history in

literary analysis and in solving disputed

authorship problems - First (?) is Thomas C. Mendenhall in 1887

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Mendenhall was Professor of Physics at Ohio State

and at University of Tokyo, Superintendent of the

USA Coast and Geodetic Survey, and later,

President of Worcester Polytechnic Institute

Mendenhall Glacier, Juneau, Alaska

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

X2 127.2, df12

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Used Naïve Bayes with Poisson and Negative

Binomial model - Out-of-sample predictive performance

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

today

- Statistical methods routinely used for textual

analyses of all kinds - Machine translation, part-of-speech tagging,

information extraction, question-answering, text

categorization, etc. - Not reported in the statistical literature (no

statisticians?)

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Text categorization

- Automatic assignment of documents with respect to

manually defined set of categories - Applications automated indexing, spam filtering,

content filters, medical coding, CRM, essay

grading - Dominant technology is supervised machine

learning - Manually classify some documents, then learn a

classification rule from them (possibly with

manual intervention)

Document Representation

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Documents usually represented as bag of words

- xis might be 0/1, counts, or weights (e.g.

tf/idf, LSI) - Many text processing choices stopwords,

stemming, phrases, synonyms, NLP, etc.

Classifier Representation

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- For instance, linear classifier

- xis derived from text of document
- yi indicates whether to put document in category
- ßj are parameters chosen to give good

classification effectiveness

Logistic Regression Model

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Linear model for log odds of category membership

- Equivalent to

- Conditional probability model

Logistic Regression as a Linear Classifier

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- If estimated probability of category membership

is greater than p, assign document to category

- Choose p to optimize expected value of your

effectiveness measure - Can change measure w/o changing model

Maximum Likelihood Training

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Choose parameters (ßj's) that maximize

probability (likelihood) of class labels (yi's)

given documents (xis)

- Maximizing (log-)likelihood can be viewed as

minimizing a loss function

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Hastie, Friedman Tibshirani

Avoiding Overfitting

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Text is high dimensional
- Maximum likelihood gives infinite parameter

values, poor effectiveness - Solution penalize large ßj's, e.g. maximize

- Called ridge logistic regression

A Bayesian Interpretation of Ridge Logistic

Regression

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Suppose
- we believe each ßj is a small value near 0
- and encode this belief as separate Gaussian

probability distributions over values of ßj - Bayes rule specifies our new (posterior) belief

about ß after seeing training data - Choosing maximum a posteriori value of the ß

gives same result as ridge logistic regression

Zhang Oles Results

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Reuters-21578 collection
- Ridge logistic regression plus feature selection

Bayes!

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- MAP logistic regression with Gaussian prior gives

state of the art text classification

effectiveness - But Bayesian framework more flexible than SVM for

combining knowledge with data - Feature selection
- Stopwords, IDF
- Domain knowledge
- Number of classes
- (and kernels.)

Bayesian Supervised Feature Selection

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Results on ridge logistic regression for text

classification use ad hoc feature selection - Use of feature selection ? belief (before seeing

data) that many coefficients are 0 - Put that belief into our prior on coefficients...
- Laplace prior, i.e. lasso logistic regression

Data Sets

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- ModApte subset of Reuters-21578
- 90 categories 9603 training docs 18978 features
- Reuters RCV1-v2
- 103 cats 23149 training docs 47152 features
- OHSUMED heart disease categories
- 77 cats 83944 training docs 122076 features
- Cosine normalized TFxIDF weights

Dense vs. Sparse Models (Macroaveraged F1,

Preliminary)

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Bayesian Unsupervised Feature Selection and

Weighting

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Stopwords low content words that typically are

discarded - Give them a prior with mean 0 and low variance
- Inverse document frequency (IDF) weighting
- Rare words more likely to be content indicators
- Make variance of prior inversely proportional to

frequency in collection - Experiments in progress

Bayesian Use of Domain Knowledge

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Often believe that certain words are positively

or negatively associated with category - Prior mean can encode strength of positive or

negative association - Prior variance encodes confidence

First Experiments

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- 27 RCV1-v2 Region categories
- CIA World Factbook entry for country
- Give content words higher mean and/or variance
- Only 10 training examples per category
- Shows off prior knowledge
- Limited data often the case in applications

Results (Preliminary)

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

Polytomous Logistic Regression

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Logistic regression trivially generalizes to

1-of-k problems - Cleaner than SVMs, error correcting codes, etc.
- Laplace prior particularly cool here
- Suppose 99 classes and a word that predicts class

17 - Word gets used 100 times if build 100 models, or

if use polytomous with Gaussian prior - With Laplace prior and polytomous it's used only

once - Experiments in progress, particularly on author

id

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

1-of-K Sample Results brittany-l

89 authors with at least 50 postings. 10,076

training documents, 3,322 test documents.

BMR-Laplace classification, default

hyperparameter

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

1-of-K Sample Results brittany-l

4.6 million parameters

89 authors with at least 50 postings. 10,076

training documents, 3,322 test documents.

BMR-Laplace classification, default

hyperparameter

Future

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Choose exact number of features desired
- Faster training algorithm for polytomous
- Currently using cyclic coordinate descent
- Hierarchical models
- Sharing strength among categories
- Hierarchical relationships among features
- Stemming, thesaurus classes, phrases, etc.

Text Categorization Summary

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Conditional probability models (logistic,

probit, etc.) - As powerful as other discriminative models (SVM,

boosting, etc.) - Bayesian framework provides much richer ability

to insert task knowledge - Code http//stat.rutgers.edu/madigan/BBR
- Polytomous, domain-specific priors soon

For high-dimensional predictive modeling

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Regularized regression methods are better than

ad-hoc variable selection algorithms - Regularized regression methods are more practical

than discrete model averaging (and probably make

more sense) - L1-regularization is the best way to variable

selection

For high-dimensional predictive modeling

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Regularized regression methods are better than

ad-hoc variable selection algorithms - Regularized regression methods are more practical

than discrete model averaging (and probably make

more sense) - L1-regularization is the best way to variable

selection

For high-dimensional predictive modeling

Overview Model Selection Theory Computing Continuo

us Model Selection

Ridge Lasso LARS Case Study

- Regularized regression methods are better than

ad-hoc variable selection algorithms - Regularized regression methods are more practical

than discrete model averaging (and probably make

more sense) - L1-regularization is the best way to variable

selection