# Basic Bayes: model fitting, model selection, model averaging - PowerPoint PPT Presentation

PPT – Basic Bayes: model fitting, model selection, model averaging PowerPoint presentation | free to download - id: dded9-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Basic Bayes: model fitting, model selection, model averaging

Description:

### Basic Bayes: model fitting, model selection, model averaging – PowerPoint PPT presentation

Number of Views:436
Avg rating:3.0/5.0
Slides: 57
Provided by: JoshTen6
Category:
Tags:
Transcript and Presenter's Notes

Title: Basic Bayes: model fitting, model selection, model averaging

1
Basic Bayes model fitting, model selection,
model averaging
Josh Tenenbaum MIT
2
Bayes rule
For any hypothesis h and data d,
Sum over space of alternative hypotheses
3
Bayesian inference
• Bayes rule
• An example
• Data John is coughing
• Some hypotheses
• John has a cold
• John has lung cancer
• John has a stomach flu
• Prior P(h) favors 1 and 3 over 2
• Likelihood P(dh) favors 1 and 2 over 3
• Posterior P(hd) favors 1 over 2 and 3

4
Plan for this lecture
• Some basic aspects of Bayesian statistics
• Model fitting
• Model averaging
• Model selection
• A case study in Bayesian cognitive modeling
• The number game

5
Coin flipping
• Comparing two hypotheses
• data HHTHT or HHHHH
• compare two simple hypotheses
• P(H) 0.5 vs. P(H) 1.0
• Parameter estimation (Model fitting)
• compare many hypotheses in a parameterized family
• P(H) q Infer q
• Model selection
• compare qualitatively different hypotheses, often
varying in complexity
• P(H) 0.5 vs. P(H) q

6
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
7
Comparing two simple hypotheses
• Contrast simple hypotheses
• h1 fair coin, P(H) 0.5
• Bayes rule
• With two hypotheses, use odds form

8
Comparing two simple hypotheses
• D HHTHT
• H1, H2 fair coin, always heads
• P(DH1) 1/25 P(H1) ?
• P(DH2) 0 P(H2) 1-?

9
Comparing two simple hypotheses
• D HHTHT
• H1, H2 fair coin, always heads
• P(DH1) 1/25 P(H1) 999/1000
• P(DH2) 0 P(H2) 1/1000

10
Comparing two simple hypotheses
• D HHHHH
• H1, H2 fair coin, always heads
• P(DH1) 1/25 P(H1) 999/1000
• P(DH2) 1 P(H2) 1/1000

11
Comparing two simple hypotheses
• D HHHHHHHHHH
• H1, H2 fair coin, always heads
• P(DH1) 1/210 P(H1) 999/1000
• P(DH2) 1 P(H2) 1/1000

12
The role of priors
• The fact that HHTHT looks representative of a
fair coin, and HHHHH does not, depends entirely
on our hypothesis space and prior probabilities.
• Should we be worried about that? or happy?

13
The role of intuitive theories
• The fact that HHTHT looks representative of a
fair coin, and HHHHH does not, reflects our
implicit theories of how the world works.
• Easy to imagine how a trick all-heads coin could
work high prior probability.
• Hard to imagine how a trick HHTHT coin could
work low prior probability.

14
Coin flipping
• Basic Bayes
• data HHTHT or HHHHH
• compare two hypotheses
• P(H) 0.5 vs. P(H) 1.0
• Parameter estimation (Model fitting)
• compare many hypotheses in a parameterized family
• P(H) q Infer q
• Model selection
• compare qualitatively different hypotheses, often
varying in complexity
• P(H) 0.5 vs. P(H) q

15
Parameter estimation
• Assume data are generated from a parameterized
model
• What is the value of q ?
• each value of q is a hypothesis H
• requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
16
Model selection
• Assume hypothesis space of possible models
• Which model generated the data?
• requires summing out hidden variables
• requires some form of Occams razor to trade off
complexity with fit to the data.

q
d1
d2
d3
d4
d1
d2
d3
d4
d1
d2
d3
d4
Hidden Markov model si Fair coin, Trick
coin
Fair coin P(H) 0.5
P(H) q
17
Parameter estimation vs. Model selection across
learning and development
• Causality learning the strength of a relation
vs. learning the existence and form of a relation
• Language acquisition learning a speaker's
accent, or frequencies of different words vs.
learning a new tense or syntactic rule (or
learning a new language, or the existence of
different languages)
• Concepts learning what horses look like vs.
learning that there is a new species (or learning
that there are species)
• Intuitive physics learning the mass of an object
vs. learning about gravity or angular momentum

18
A hierarchical learning framework
model
parameter setting
data
19
A hierarchical learning framework
model class
model
parameter setting
data
20
Bayesian parameter estimation
• Assume data are generated from a model
• What is the value of q ?
• each value of q is a hypothesis H
• requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
21
Some intuitions
• D 10 flips, with 5 heads and 5 tails.
• q P(H) on next flip? 50
• Why? 50 5 / (55) 5/10.
• Why? The future will be like the past
• P(H) on next flip? Closer to 50 than to 40.
• Why? Prior knowledge.

22
Integrating prior knowledge and data
• Posterior distribution P(q D) is a probability
density over q P(H)
• Need to specify likelihood P(D q ) and prior
distribution P(q ).

23
Likelihood and prior
• Likelihood Bernoulli distribution
• P(D q ) q NH (1-q ) NT
• NT number of tails
• Prior
• P(q ) ?

?
24
Some intuitions
• D 10 flips, with 5 heads and 5 tails.
• q P(H) on next flip? 50
• Why? 50 5 / (55) 5/10.
• Why? Maximum likelihood
• P(H) on next flip? Closer to 50 than to 40.
• Why? Prior knowledge.

25
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a set
of previous experiences
• strategy often used with neural networks or
building invariance into machine vision.
• e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair
• In fact, this is a sensible statistical idea...

26
Likelihood and prior
• Likelihood Bernoulli(q ) distribution
• P(D q ) q NH (1-q ) NT
• NT number of tails
• Prior Beta(FH,FT) distribution
• P(q ) ? q FH-1 (1-q ) FT-1
• FH fictitious observations of heads
• FT fictitious observations of tails

27
Shape of the Beta prior
28
Shape of the Beta prior
FH 0.5, FT 0.5
FH 0.5, FT 2
FH 2, FT 0.5
FH 2, FT 2
29
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
• Posterior is Beta(NHFH,NTFT)
• same form as prior!

30
Conjugate priors
• A prior p(q ) is conjugate to a likelihood
function p(D q ) if the posterior has the same
functional form of the prior.
• Parameter values in the prior can be thought of
as a summary of fictitious observations.
• Different parameter values in the prior and
posterior reflect the impact of observed data.
• Conjugate priors exist for many standard models
(e.g., all exponential family models)

31
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
FH,FT
q
D NH,NT
d1 d2 d3 d4
H
• Posterior predictive distribution

P(HD, FH, FT) P(Hq ) P(q D, FH, FT) dq
Bayesian model averaging
32
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
FH,FT
q
D NH,NT
d1 d2 d3 d4
H
• Posterior predictive distribution

(NHFH)
P(HD, FH, FT)
(NHFHNTFT)
33
Some examples
• e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next flip
1004 / (10041006) 49.95
• e.g., F 3 heads, 3 tails weak expectation
that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next flip
7 / (79) 43.75
• Prior knowledge too weak

34
But flipping thumbtacks
• e.g., F 4 heads, 3 tails weak expectation
that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
6 / (63) 67
• Some prior knowledge is always necessary to avoid
jumping to hasty conclusions...
• Suppose F After seeing 1 heads, 0 tails,
P(H) on next flip 1 / (10) 100

35
Origin of prior knowledge
• Suppose you have previously seen 2000 coin flips

36
Problems with simple empiricism
• Havent really seen 2000 coin flips, or any flips
of a thumbtack
• Prior knowledge is stronger than raw experience
justifies
• Havent seen exactly equal number of heads and
tails
• Prior knowledge is smoother than raw experience
justifies
• Should be a difference between observing 2000
flips of a single coin versus observing 10 flips
each for 200 coins, or 1 flip each for 2000 coins
• Prior knowledge is more structured than raw
experience

37
A simple theory
• Coins are manufactured by a standardized
procedure that is effective but not perfect, and
symmetric with respect to heads and tails. Tacks
are asymmetric, and manufactured to less exacting
standards.
• Justifies generalizing from previous coins to the
present coin.
• Justifies smoother and stronger prior than raw
experience alone.
• Explains why seeing 10 flips each for 200 coins
is more valuable than seeing 2000 flips of one
coin.

38
A hierarchical Bayesian model
physical knowledge
Coins
q Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
q200
q1
q2
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
• Qualitative physical knowledge (symmetry) can
influence estimates of continuous parameters (FH,
FT).
• Explains why 10 flips of 200 coins are better
than 2000 flips of a single coin more

39
Summary Bayesian parameter estimation
• Learning the parameters of a generative model as
Bayesian inference.
• Prediction by Bayesian model averaging.
• Conjugate priors
• an elegant way to represent simple kinds of prior
knowledge.
• Hierarchical Bayesian models
• integrate knowledge across instances of a system,
or different systems within a domain, to explain
the origins of priors.

40
A hierarchical learning framework
model class
Model selection
model
parameter setting
data
41
Stability versus Flexibility
• Can all domain knowledge be represented with
conjugate priors?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on
next flip 1025 / (10251000) 50.6. Looks
like nothing unusual.
• How do we balance stability and flexibility?
• Stability 6 heads, 4 tails q 0.5
• Flexibility 25 heads, 0 tails q 1

42
Bayesian model selection
vs.
• Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) q ?

43
Comparing simple and complex hypotheses
• P(H) q is more complex than P(H) 0.5 in two
ways
• P(H) 0.5 is a special case of P(H) q
• for any observed sequence D, we can choose q such
that D is more probable than if P(H) 0.5

44
Comparing simple and complex hypotheses
Probability
q 0.5
45
Comparing simple and complex hypotheses
q 1.0
Probability
q 0.5
46
Comparing simple and complex hypotheses
Probability
q 0.6
q 0.5
D HHTHT
47
Comparing simple and complex hypotheses
• P(H) q is more complex than P(H) 0.5 in two
ways
• P(H) 0.5 is a special case of P(H) q
• for any observed sequence X, we can choose q such
that X is more probable than if P(H) 0.5
• How can we deal with this?
• Some version of Occams razor?
• Bayes automatic version of Occams razor follows
from the law of conservation of belief.

48
Comparing simple and complex hypotheses
• P(h1D) P(Dh1) P(h1)
• P(h0D) P(Dh0) P(h0)

x
The evidence or marginal likelihood The
probability that randomly selected parameters
from the prior would generate the data.
49
(No Transcript)
50
Stability versus Flexibility revisited
fair/unfair?
• Model class hypothesis is this coin fair or
unfair?
• Example probabilities
• P(fair) 0.999
• P(q fair) is Beta(1000,1000)
• P(q unfair) is Beta(1,1)
• 25 heads in a row propagates up, affecting q and
then P(fairD)

FH,FT
q
d1 d2 d3 d4

0.001
51
Bayesian Occams Razor
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
52
Occams Razor in curve fitting
53
(No Transcript)
54
M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
M1 A model that is too simple is unlikely to
generate the data. M3 A model that
is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
55
The blessing of abstraction
• Often easier to learn at higher levels of
abstraction
• Easier to learn that you have a biased coin than
to learn its precise bias, or to learn that you
have a second-order polynomial than to learn the
precise coefficients.
• Easier to learn causal structure than causal
strength.
• Easier to learn that you are hearing two
languages (vs. one), or to learn that language
has a hierarchical phrase structure, than to
learn how any one language works.
• Why? Hypothesis space gets smaller as you go up.
• But the total hypothesis space gets bigger when
we add levels of abstraction (e.g., model
selection).
• Can make better predictions by expanding the
hypothesis space, if we introduce good inductive
biases.

56
Summary
• Three kinds of Bayesian inference
• Comparing two simple hypotheses
• Parameter estimation
• The importance and subtlety of prior knowledge
• Model selection
• Bayesian Occams razor, the blessing of
abstraction
• Key concepts
• Probabilistic generative models
• Hierarchies of abstraction, with statistical
inference at all levels
• Flexibly structured representations