Loading...

PPT – Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful PowerPoint presentation | free to download - id: 4cc7c3-MWFiN

The Adobe Flash plugin is needed to view this content

Why I am a Bayesian(and why you should become

one, too)orClassical statistics considered

harmful

- Kevin Murphy
- UBC CS Stats
- 9 February 2005

Where does the title come from?

- Why I am not a Bayesian, Glymour, 1981
- Why Glymour is a Bayesian, Rosenkrantz, 1983
- Why isnt everyone a Bayesian?,Efron, 1986
- Bayesianism and causality, or, why I am only a

half-Bayesian, Pearl, 2001

Many other such philosophical essays

Frequentist vs Bayesian

- Prob objective relative frequencies
- Params are fixed unknown constants, so cannot

write e.g. P(?0.5D) - Estimators should be good when averaged across

many trials

- Prob degrees of belief (uncertainty)
- Can write P(anythingD)
- Estimators should be good for the available data

Source All of statistics, Larry Wasserman

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?

Coin flipping

HHTHT

HHHHH

What process produced these sequences?

The following slides are from Tenenbaum

Griffiths

Hypotheses in coin flipping

Describe processes by which D could be generated

D

HHTHT

- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...

Hypotheses in coin flipping

Describe processes by which D could be generated

D

HHTHT

- Fair coin, P(H) 0.5
- Coin with P(H) p
- Markov model
- Hidden Markov model
- ...

Representing generative models

- Graphical model notation
- Pearl (1988), Jordan (1998)
- Variables are nodes, edges indicate dependency
- Directed edges show causal process of data

generation

Models with latent structure

- Not all nodes in a graphical model need to be

observed - Some variables reflect latent structure, used in

generating D but unobserved

How do we select the best model?

Bayes rule

Sum over space of hypotheses

The origin of Bayes rule

- A simple consequence of using probability to

represent degrees of belief - For any two random variables

Why represent degrees of belief with

probabilities?

- Good statistics
- consistency, and worst-case error bounds.
- Cox Axioms
- necessary to cohere with common sense
- Dutch Book Survival of the Fittest
- if your beliefs do not accord with the laws of

probability, then you can always be out-gambled

by someone whose beliefs do so accord. - Provides a theory of incremental learning
- a common currency for combining prior knowledge

and the lessons of experience.

Hypotheses in Bayesian inference

- Hypotheses H refer to processes that could have

generated the data D - Bayesian inference provides a distribution over

these hypotheses, given D - P(DH) is the probability of D being generated by

the process identified by H - Hypotheses H are mutually exclusive only one

process could have generated D

Coin flipping

- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p

Coin flipping

- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p

Comparing two simple hypotheses

- Contrast simple hypotheses
- H1 fair coin, P(H) 0.5
- H2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form

Bayes rule in odds form

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)

x

Prior odds

Posterior odds

Bayes factor(likelihood ratio)

Data HHTHT

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
- P(H1D) / P(H2D) infinity

x

Data HHHHH

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 30

x

Data HHHHHHHHHH

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
- P(H1D) / P(H2D) ? 1

x

Coin flipping

- Comparing two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Comparing simple and complex hypotheses
- P(H) 0.5 vs. P(H) p

Comparing simple and complex hypotheses

vs.

- Which provides a better account of the data the

simple hypothesis of a fair coin, or the complex

hypothesis that P(H) p?

Comparing simple and complex hypotheses

- P(H) p is more complex than P(H) 0.5 in two

ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such

that X is more probable than if P(H) 0.5

Comparing simple and complex hypotheses

Probability

Comparing simple and complex hypotheses

Probability

HHHHH p 1.0

Comparing simple and complex hypotheses

Probability

HHTHT p 0.6

Comparing simple and complex hypotheses

- P(H) p is more complex than P(H) 0.5 in two

ways - P(H) 0.5 is a special case of P(H) p
- for any observed sequence X, we can choose p such

that X is more probable than if P(H) 0.5 - How can we deal with this?
- frequentist hypothesis testing
- information theorist minimum description length
- Bayesian just use probability theory!

Comparing simple and complex hypotheses

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p

x

Comparing simple and complex hypotheses

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p

x

Marginal likelihood

likelihood

Prior

Likelihood and prior

- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1

?

A simple method of specifying priors

- Imagine some fictitious trials, reflecting a set

of previous experiences - strategy often used with neural networks
- e.g., F 1000 heads, 1000 tails strong

expectation that any new coin will be fair - In fact, this is a sensible statistical idea...

Likelihood and prior

- Likelihood
- P(D p) pNH (1-p)NT
- NH number of heads
- NT number of tails
- Prior
- P(p) ? pFH-1 (1-p)FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails

Beta(FH,FT)

(pseudo-counts)

Posterior / prior x likelihood

- Prior
- Likelihood
- Posterior

Same form!

Conjugate priors

- Exist for many standard distributions
- formula for exponential family conjugacy
- Define prior in terms of fictitious observations
- Beta is conjugate to Bernoulli (coin-flipping)

FH FT 1 FH FT 3 FH FT 1000

Normalizing constants

- Prior
- Normalizing constant for Beta distribution
- Posterior
- Hence marginal likelihood is

Comparing simple and complex hypotheses

- P(H1D) P(DH1) P(H1)
- P(H2D) P(DH2) P(H2)
- Computing P(DH1) is easy
- P(DH1) 1/2N
- Compute P(DH2) by averaging over p

x

Likelihood for H1

Marginal likelihood (evidence) for H2

Marginal likelihood for H1 and H2

Probability

Marginal likelihood is an average over all values

of p

Sensitivity to hyper-parameters

Bayesian model selection

- Simple and complex hypotheses can be compared

directly using Bayes rule - requires summing over latent variables
- Complex hypotheses are penalized for their

greater flexibility Bayesian Occams razor - Maximum likelihood cannot be used for model

selection (always prefers hypothesis with largest

number of parameters)

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?

Example Belgian euro-coins

- A Belgian euro spun N250 times came up heads

X140. - It looks very suspicious to me. If the coin were

unbiased the chance of getting a result as

extreme as that would be less than 7 Barry

Blight, LSE (reported in Guardian, 2002)

Source Mackay exercise 3.15

Classical hypothesis testing

- Null hypothesis H0 eg. q 0.5 (unbiased coin)
- For classical analysis, dont need to specify

alternative hypothesis, but later we will useH1

? ? 0.5 - Need a decision rule that maps data D to accept/

reject of H0. - Define a scalar measure of deviance d(D) from the

null hypothesis e.g., Nh or ?2

P-values

- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of

getting data at least that extreme given H0

P-values

R

- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of

getting data at least that extreme given H0 - Usually choose ? so that false rejection rate of

H0 is below significance level ? 0.05

P-values

R

- Define p-value of threshold ? as
- Intuitively, p-value of data is probability of

getting data at least that extreme given H0 - Usually choose ? so that false rejection rate of

H0 is below significance level ? 0.05 - Often use asymptotic approximation to

distribution of d(D) under H0 as N ! 1

P-value for euro coins

- N 250 trials, X140 heads
- P-value is less than 7
- If N250 and X141, pval 0.0497, so we can

reject the null hypothesis at the significance

level of 5. - This does not mean P(H0D)0.07!

Pval(1-binocdf(139,n,0.5)) binocdf(110,n,0.5)

Bayesian analysis of euro-coin

- Assume P(H0)P(H1)0.5
- Assume P(p) Beta(?,?)
- Setting ?1 yields a uniform (non-informative)

prior.

Bayesian analysis of euro-coin

- If ?1,so H0 (unbiased) is (slightly) more

probable than H1 (biased). - By varying ? over a large range, the best we can

do is make B1.9, which does not strongly support

the biased coin hypothesis. - Other priors yield similar results.
- Bayesian analysis contradicts classical analysis.

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense

The likelihood principle

- In order to choose between hypotheses H0 and H1

given observed data, one should ask how likely

the observed data are do not ask questions about

data that we might have observed but did not,

such as - This principle can be proved from two simpler

principles called conditionality and sufficiency.

Frequentist statistics violates the likelihood

principle

- The use of P-values implies that a hypothesis

that may be true can be rejected because it has

not predicted observable results that have not

actually occurred. Jeffreys, 1961

Another example

- Suppose X N(?,?2) we observe x3
- Compare H0 ?0 with H1 ?gt0
- P-value P(X 3H0)0.001, so reject H0
- Bayesian approach update P(?X) using conjugate

analysis compute Bayes factor to compare H0 and

H1

When are P-values valid?

- Suppose X N(?,?2) we observe Xx.
- One-sided hypothesis test H0 ? ?0

vs H1 ? gt ?0 - If P(?) / 1, then P(?x) N(x,?2), so
- P-value is the same in this case, since Gaussian

is symmetric in its arguments

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense

Stopping rule principle

- Inferences you make should only depend on the

observed data, not the reasons why this data was

collected. - If you look at your data to decide when to stop

collecting, this should not change any

conclusions you draw. - Follows from likelihood principle.

Frequentist statistics violates stopping rule

principle

- Observe DHHHTHHHHTHHT. Is there evidence of bias

(Pt gt Ph)? - Let X3 heads be observed random variable and

N12 trials be fixed constant. Define H0 Ph0.5.

Then, at the 5 level, there is no significant

evidence of bias

Frequentist statistics violates stopping rule

principle

- Suppose the data was generated by tossing coins

until we got X3 heads. - Now X3 heads is a fixed constant and N12 is a

random variable. Now there is significant

evidence of bias!

First n-1 trials contain x-1 heads last trial

always heads

Ignoring stopping criterion can mislead classical

estimators

- Let Xi Bernoulli(?)
- Max lik. estimator
- MLE is unbiased
- Toss coin if head, stop, else toss second coin.

P(H)?, P(HT)? (1-?), P(TT)(1-?)2. - Now MLE is biased!
- Many classical rules for assessing significance

when complex stopping rules are used.

Outline

- Hypothesis testing Bayesian approach
- Hypothesis testing classical approach
- Whats wrong the classical approach?
- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense

Confidence intervals

- An interval (?min(D),?max(D)) is a 95 CI if ?

lies inside this interval 95 of the time across

repeated draws DP(.?) - This does not mean P(? 2 CID) 0.95!

Mackay sec 37.3

Example

- Draw 2 integers from
- If ?39, we would expect

Example

- If ?39, we would expect
- Define confidence interval as
- eg (x1,x2)(40,39), CI(39,39)
- 75 of the time, this will contain the true ?

CIs violate common sense

- If ?39, we would expect
- If (x1,x2)(39,39), then CI(39,39) at level 75.

But clearly P(?39D)P(?38D)0.5 - If (x1,x2)(39,40), then CI(39,39), but clearly

P(?39D)1.0.

Whats wrong with the classical approach?

- Violates likelihood principle
- Violates stopping rule principle
- Violates common sense

Whats right about the Bayesian approach?

- Simple and natural
- Optimal mechanism for reasoning under uncertainty
- Generalization of Aristotelian logic that reduces

to deductive logic if our hypotheses are either

true or false - Supports interesting (human-like) kinds of

learning

(No Transcript)

Bayesian humor

- A Bayesian is one who, vaguely expecting a

horse, and catching a glimpse of a donkey,

strongly believes he has seen a mule.