Basic Bayes: model fitting, model selection, model averaging - PowerPoint PPT Presentation

Loading...

PPT – Basic Bayes: model fitting, model selection, model averaging PowerPoint presentation | free to download - id: dded9-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Basic Bayes: model fitting, model selection, model averaging

Description:

Basic Bayes: model fitting, model selection, model averaging – PowerPoint PPT presentation

Number of Views:436
Avg rating:3.0/5.0
Slides: 57
Provided by: JoshTen6
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Basic Bayes: model fitting, model selection, model averaging


1
Basic Bayes model fitting, model selection,
model averaging
Josh Tenenbaum MIT
2
Bayes rule
For any hypothesis h and data d,
Sum over space of alternative hypotheses
3
Bayesian inference
  • Bayes rule
  • An example
  • Data John is coughing
  • Some hypotheses
  • John has a cold
  • John has lung cancer
  • John has a stomach flu
  • Prior P(h) favors 1 and 3 over 2
  • Likelihood P(dh) favors 1 and 2 over 3
  • Posterior P(hd) favors 1 over 2 and 3

4
Plan for this lecture
  • Some basic aspects of Bayesian statistics
  • Model fitting
  • Model averaging
  • Model selection
  • A case study in Bayesian cognitive modeling
  • The number game

5
Coin flipping
  • Comparing two hypotheses
  • data HHTHT or HHHHH
  • compare two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Parameter estimation (Model fitting)
  • compare many hypotheses in a parameterized family
  • P(H) q Infer q
  • Model selection
  • compare qualitatively different hypotheses, often
    varying in complexity
  • P(H) 0.5 vs. P(H) q

6
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
7
Comparing two simple hypotheses
  • Contrast simple hypotheses
  • h1 fair coin, P(H) 0.5
  • h2always heads, P(H) 1.0
  • Bayes rule
  • With two hypotheses, use odds form

8
Comparing two simple hypotheses
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) ?
  • P(DH2) 0 P(H2) 1-?

9
Comparing two simple hypotheses
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 0 P(H2) 1/1000

10
Comparing two simple hypotheses
  • D HHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000

11
Comparing two simple hypotheses
  • D HHHHHHHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/210 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000

12
The role of priors
  • The fact that HHTHT looks representative of a
    fair coin, and HHHHH does not, depends entirely
    on our hypothesis space and prior probabilities.
  • Should we be worried about that? or happy?

13
The role of intuitive theories
  • The fact that HHTHT looks representative of a
    fair coin, and HHHHH does not, reflects our
    implicit theories of how the world works.
  • Easy to imagine how a trick all-heads coin could
    work high prior probability.
  • Hard to imagine how a trick HHTHT coin could
    work low prior probability.

14
Coin flipping
  • Basic Bayes
  • data HHTHT or HHHHH
  • compare two hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Parameter estimation (Model fitting)
  • compare many hypotheses in a parameterized family
  • P(H) q Infer q
  • Model selection
  • compare qualitatively different hypotheses, often
    varying in complexity
  • P(H) 0.5 vs. P(H) q

15
Parameter estimation
  • Assume data are generated from a parameterized
    model
  • What is the value of q ?
  • each value of q is a hypothesis H
  • requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
16
Model selection
  • Assume hypothesis space of possible models
  • Which model generated the data?
  • requires summing out hidden variables
  • requires some form of Occams razor to trade off
    complexity with fit to the data.

q
d1
d2
d3
d4
d1
d2
d3
d4
d1
d2
d3
d4
Hidden Markov model si Fair coin, Trick
coin
Fair coin P(H) 0.5
P(H) q
17
Parameter estimation vs. Model selection across
learning and development
  • Causality learning the strength of a relation
    vs. learning the existence and form of a relation
  • Language acquisition learning a speaker's
    accent, or frequencies of different words vs.
    learning a new tense or syntactic rule (or
    learning a new language, or the existence of
    different languages)
  • Concepts learning what horses look like vs.
    learning that there is a new species (or learning
    that there are species)
  • Intuitive physics learning the mass of an object
    vs. learning about gravity or angular momentum

18
A hierarchical learning framework
model
parameter setting
data
19
A hierarchical learning framework
model class
model
parameter setting
data
20
Bayesian parameter estimation
  • Assume data are generated from a model
  • What is the value of q ?
  • each value of q is a hypothesis H
  • requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
21
Some intuitions
  • D 10 flips, with 5 heads and 5 tails.
  • q P(H) on next flip? 50
  • Why? 50 5 / (55) 5/10.
  • Why? The future will be like the past
  • Suppose we had seen 4 heads and 6 tails.
  • P(H) on next flip? Closer to 50 than to 40.
  • Why? Prior knowledge.

22
Integrating prior knowledge and data
  • Posterior distribution P(q D) is a probability
    density over q P(H)
  • Need to specify likelihood P(D q ) and prior
    distribution P(q ).

23
Likelihood and prior
  • Likelihood Bernoulli distribution
  • P(D q ) q NH (1-q ) NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(q ) ?

?
24
Some intuitions
  • D 10 flips, with 5 heads and 5 tails.
  • q P(H) on next flip? 50
  • Why? 50 5 / (55) 5/10.
  • Why? Maximum likelihood
  • Suppose we had seen 4 heads and 6 tails.
  • P(H) on next flip? Closer to 50 than to 40.
  • Why? Prior knowledge.

25
A simple method of specifying priors
  • Imagine some fictitious trials, reflecting a set
    of previous experiences
  • strategy often used with neural networks or
    building invariance into machine vision.
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • In fact, this is a sensible statistical idea...

26
Likelihood and prior
  • Likelihood Bernoulli(q ) distribution
  • P(D q ) q NH (1-q ) NT
  • NH number of heads
  • NT number of tails
  • Prior Beta(FH,FT) distribution
  • P(q ) ? q FH-1 (1-q ) FT-1
  • FH fictitious observations of heads
  • FT fictitious observations of tails

27
Shape of the Beta prior
28
Shape of the Beta prior
FH 0.5, FT 0.5
FH 0.5, FT 2
FH 2, FT 0.5
FH 2, FT 2
29
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
  • Posterior is Beta(NHFH,NTFT)
  • same form as prior!

30
Conjugate priors
  • A prior p(q ) is conjugate to a likelihood
    function p(D q ) if the posterior has the same
    functional form of the prior.
  • Parameter values in the prior can be thought of
    as a summary of fictitious observations.
  • Different parameter values in the prior and
    posterior reflect the impact of observed data.
  • Conjugate priors exist for many standard models
    (e.g., all exponential family models)

31
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
FH,FT
q
D NH,NT
d1 d2 d3 d4
H
  • Posterior predictive distribution

P(HD, FH, FT) P(Hq ) P(q D, FH, FT) dq
Bayesian model averaging
32
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
FH,FT
q
D NH,NT
d1 d2 d3 d4
H
  • Posterior predictive distribution

(NHFH)
P(HD, FH, FT)
(NHFHNTFT)
33
Some examples
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    1004 / (10041006) 49.95
  • e.g., F 3 heads, 3 tails weak expectation
    that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    7 / (79) 43.75
  • Prior knowledge too weak

34
But flipping thumbtacks
  • e.g., F 4 heads, 3 tails weak expectation
    that tacks are slightly biased towards heads
  • After seeing 2 heads, 0 tails, P(H) on next flip
    6 / (63) 67
  • Some prior knowledge is always necessary to avoid
    jumping to hasty conclusions...
  • Suppose F After seeing 1 heads, 0 tails,
    P(H) on next flip 1 / (10) 100

35
Origin of prior knowledge
  • Tempting answer prior experience
  • Suppose you have previously seen 2000 coin flips
    1000 heads, 1000 tails

36
Problems with simple empiricism
  • Havent really seen 2000 coin flips, or any flips
    of a thumbtack
  • Prior knowledge is stronger than raw experience
    justifies
  • Havent seen exactly equal number of heads and
    tails
  • Prior knowledge is smoother than raw experience
    justifies
  • Should be a difference between observing 2000
    flips of a single coin versus observing 10 flips
    each for 200 coins, or 1 flip each for 2000 coins
  • Prior knowledge is more structured than raw
    experience

37
A simple theory
  • Coins are manufactured by a standardized
    procedure that is effective but not perfect, and
    symmetric with respect to heads and tails. Tacks
    are asymmetric, and manufactured to less exacting
    standards.
  • Justifies generalizing from previous coins to the
    present coin.
  • Justifies smoother and stronger prior than raw
    experience alone.
  • Explains why seeing 10 flips each for 200 coins
    is more valuable than seeing 2000 flips of one
    coin.

38
A hierarchical Bayesian model
physical knowledge
Coins
q Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
q200
q1
q2
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
  • Qualitative physical knowledge (symmetry) can
    influence estimates of continuous parameters (FH,
    FT).
  • Explains why 10 flips of 200 coins are better
    than 2000 flips of a single coin more
    informative about FH, FT.

39
Summary Bayesian parameter estimation
  • Learning the parameters of a generative model as
    Bayesian inference.
  • Prediction by Bayesian model averaging.
  • Conjugate priors
  • an elegant way to represent simple kinds of prior
    knowledge.
  • Hierarchical Bayesian models
  • integrate knowledge across instances of a system,
    or different systems within a domain, to explain
    the origins of priors.

40
A hierarchical learning framework
model class
Model selection
model
parameter setting
data
41
Stability versus Flexibility
  • Can all domain knowledge be represented with
    conjugate priors?
  • Suppose you flip a coin 25 times and get all
    heads. Something funny is going on
  • But with F 1000 heads, 1000 tails, P(heads) on
    next flip 1025 / (10251000) 50.6. Looks
    like nothing unusual.
  • How do we balance stability and flexibility?
  • Stability 6 heads, 4 tails q 0.5
  • Flexibility 25 heads, 0 tails q 1

42
Bayesian model selection
vs.
  • Which provides a better account of the data the
    simple hypothesis of a fair coin, or the complex
    hypothesis that P(H) q ?

43
Comparing simple and complex hypotheses
  • P(H) q is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) q
  • for any observed sequence D, we can choose q such
    that D is more probable than if P(H) 0.5

44
Comparing simple and complex hypotheses
Probability
q 0.5
45
Comparing simple and complex hypotheses
q 1.0
Probability
q 0.5
46
Comparing simple and complex hypotheses
Probability
q 0.6
q 0.5
D HHTHT
47
Comparing simple and complex hypotheses
  • P(H) q is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) q
  • for any observed sequence X, we can choose q such
    that X is more probable than if P(H) 0.5
  • How can we deal with this?
  • Some version of Occams razor?
  • Bayes automatic version of Occams razor follows
    from the law of conservation of belief.

48
Comparing simple and complex hypotheses
  • P(h1D) P(Dh1) P(h1)
  • P(h0D) P(Dh0) P(h0)

x
The evidence or marginal likelihood The
probability that randomly selected parameters
from the prior would generate the data.
49
(No Transcript)
50
Stability versus Flexibility revisited
fair/unfair?
  • Model class hypothesis is this coin fair or
    unfair?
  • Example probabilities
  • P(fair) 0.999
  • P(q fair) is Beta(1000,1000)
  • P(q unfair) is Beta(1,1)
  • 25 heads in a row propagates up, affecting q and
    then P(fairD)

FH,FT
q
d1 d2 d3 d4
P(fair25 heads) P(25 headsfair)
P(fair) P(unfair25 heads) P(25
headsunfair) P(unfair)

0.001
51
Bayesian Occams Razor
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
52
Occams Razor in curve fitting
53
(No Transcript)
54
M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
M1 A model that is too simple is unlikely to
generate the data. M3 A model that
is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
55
The blessing of abstraction
  • Often easier to learn at higher levels of
    abstraction
  • Easier to learn that you have a biased coin than
    to learn its precise bias, or to learn that you
    have a second-order polynomial than to learn the
    precise coefficients.
  • Easier to learn causal structure than causal
    strength.
  • Easier to learn that you are hearing two
    languages (vs. one), or to learn that language
    has a hierarchical phrase structure, than to
    learn how any one language works.
  • Why? Hypothesis space gets smaller as you go up.
  • But the total hypothesis space gets bigger when
    we add levels of abstraction (e.g., model
    selection).
  • Can make better predictions by expanding the
    hypothesis space, if we introduce good inductive
    biases.

56
Summary
  • Three kinds of Bayesian inference
  • Comparing two simple hypotheses
  • Parameter estimation
  • The importance and subtlety of prior knowledge
  • Model selection
  • Bayesian Occams razor, the blessing of
    abstraction
  • Key concepts
  • Probabilistic generative models
  • Hierarchies of abstraction, with statistical
    inference at all levels
  • Flexibly structured representations
About PowerShow.com