Bayesian%20models%20of%20inductive%20learning - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian%20models%20of%20inductive%20learning

Description:

In-depth examples of basic and advanced models: how the math works & what it buys you. ... Basic of Bayesian inference (Josh) Graphical models, causal ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 75
Provided by: josht150
Category:

less

Transcript and Presenter's Notes

Title: Bayesian%20models%20of%20inductive%20learning


1
Bayesian models of inductive learning
Tom Griffiths UC Berkeley
Josh Tenenbaum MIT
Charles Kemp MIT
2
What to expect
  • What youll get out of this tutorial
  • Our view of what Bayesian models have to offer
    cognitive science.
  • In-depth examples of basic and advanced models
    how the math works what it buys you.
  • Some (not extensive) comparison to other
    approaches.
  • Opportunities to ask questions.
  • What you wont get
  • Detailed, hands-on how-to.
  • Where you can learn more
  • http//bayesiancognition.com
  • Trends in Cognitive Sciences, July 2006, special
    issue on Probabilistic Models of Cognition.

3
Outline
  • Morning
  • Introduction Why Bayes? (Josh)
  • Basic of Bayesian inference (Josh)
  • Graphical models, causal inference and learning
    (Tom)
  • Afternoon
  • Hierarchical Bayesian models, property induction,
    and learning domain structures (Charles)
  • Methods of approximate learning and inference,
    probabilistic models of semantic memory (Tom)

4
Why Bayes?
  • The problem of induction
  • How does the mind form inferences,
    generalizations, models or theories about the
    world from impoverished data?
  • Induction is ubiquitous in cognition
  • Vision ( audition, touch, or other perceptual
    modalities)
  • Language (understanding, production)
  • Concepts (semantic knowledge, common sense)
  • Causal learning and reasoning
  • Decision-making and action (production,
    understanding)
  • Bayes gives a general framework for explaining
    how induction can work in principle, and perhaps,
    how it does work in the mind.

5
Grammar G
P(S G)
Phrase structure S
P(U S)
Utterance U
P(S U, G) P(U S) x P(S G)
Bottom-up Top-down
6
Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
7
The approach
  • Key concepts
  • Inference in probabilistic generative models
  • Hierarchical probabilistic models, with inference
    at all levels of abstraction
  • Structured knowledge representations graphs,
    grammars, predicate logic, schemas, theories
  • Flexible structures, with complexity constrained
    by Bayesian Occams razor
  • Approximate methods of learning and inference
    Expectation-Maximization (EM), Markov chain Monte
    Carlo (MCMC)
  • Much recent progress!
  • Computational resources to implement and test
    models that we could dream up but not
    realistically imagine working with
  • New theoretical tools let us develop models that
    we could not clearly conceive of before.

8
Vision as probabilistic parsing
(Han and Zhu, 2006)
9
(No Transcript)
10
Word learning on planet Gazoob
  • Can you pick out the tufas?

11
Learning word meanings
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Principles
Structure
Data
12
Causal learning and reasoning
Principles
Structure
Data
13
Goal-directed action (production and
comprehension)
(Wolpert et al., 2003)
14
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Algorithm
  • Cognitive psychology
  • Implementation
  • Neurobiology

15
Alternative approaches to inductive learning and
inference
  • Associative learning
  • Connectionist networks
  • Similarity to examples
  • Toolkit of simple heuristics
  • Constraint satisfaction
  • Analogical mapping

16
Summary Why Bayes?
  • A unifying framework for explaining cognition.
  • How people can learn so much from such limited
    data.
  • Strong quantitative models with minimal ad hoc
    assumptions.
  • Why algorithmic-level models work the way they
    do.
  • A framework for understanding how structured
    knowledge and statistical inference interact.
  • How structured knowledge guides statistical
    inference, and may itself be acquired through
    statistical means.
  • What forms knowledge takes, at multiple levels of
    abstraction.
  • What knowledge must be innate, and what can be
    learned.
  • How flexible knowledge structures may grow as
    required by the data, with complexity controlled
    by Occams razor.

17
Outline
  • Morning
  • Introduction Why Bayes? (Josh)
  • Basic of Bayesian inference (Josh)
  • Graphical models, causal inference and learning
    (Tom)
  • Afternoon
  • Hierarchical Bayesian models, property induction,
    and learning domain structures (Charles)
  • Methods of approximate learning and inference,
    probabilistic models of semantic memory (Tom)

18
Bayes rule
For any hypothesis h and data d,
Sum over space of alternative hypotheses
19
Bayesian inference
  • Bayes rule
  • An example
  • Data John is coughing
  • Some hypotheses
  • John has a cold
  • John has emphysema
  • John has a stomach flu
  • Prior favors 1 and 3 over 2
  • Likelihood P(dh) favors 1 and 2 over 3
  • Posterior P(dh) favors 1 over 2 and 3

20
Coin flipping
  • Basic Bayes
  • data HHTHT or HHHHH
  • compare two simple hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Parameter estimation (Model fitting)
  • compare many hypotheses in a parameterized family
  • P(H) q Infer q
  • Model selection
  • compare qualitatively different hypotheses, often
    varying in complexity
  • P(H) 0.5 vs. P(H) q

21
Coin flipping
HHTHT
HHHHH
What process produced these sequences?
22
Comparing two simple hypotheses
  • Contrast simple hypotheses
  • h1 fair coin, P(H) 0.5
  • h2always heads, P(H) 1.0
  • Bayes rule
  • With two hypotheses, use odds form

23
Comparing two simple hypotheses
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) ?
  • P(DH2) 0 P(H2) 1-?

24
Comparing two simple hypotheses
  • D HHTHT
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 0 P(H2) 1/1000

25
Comparing two simple hypotheses
  • D HHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/25 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000

26
Comparing two simple hypotheses
  • D HHHHHHHHHH
  • H1, H2 fair coin, always heads
  • P(DH1) 1/210 P(H1) 999/1000
  • P(DH2) 1 P(H2) 1/1000

27
The role of intuitive theories
  • The fact that HHTHT looks representative of a
    fair coin and HHHHH does not reflects our
    implicit theories of how the world works.
  • Easy to imagine how a trick all-heads coin could
    work high prior probability.
  • Hard to imagine how a trick HHTHT coin could
    work low prior probability.

28
Coin flipping
  • Basic Bayes
  • data HHTHT or HHHHH
  • compare two hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Parameter estimation (Model fitting)
  • compare many hypotheses in a parameterized family
  • P(H) q Infer q
  • Model selection
  • compare qualitatively different hypotheses, often
    varying in complexity
  • P(H) 0.5 vs. P(H) q

29
Parameter estimation
  • Assume data are generated from a parameterized
    model
  • What is the value of q ?
  • each value of q is a hypothesis H
  • requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
30
Model selection
  • Assume hypothesis space of possible models
  • Which model generated the data?
  • requires summing out hidden variables
  • requires some form of Occams razor to trade off
    complexity with fit to the data.

q
d1
d2
d3
d4
d1
d2
d3
d4
d1
d2
d3
d4
Hidden Markov model si Fair coin, Trick
coin
Fair coin P(H) 0.5
P(H) q
31
Parameter estimation vs. Model selection across
learning and development
  • Causality learning the strength of a relation
    vs. learning the existence and form of a relation
  • Language acquisition learning a speaker's
    accent, or frequencies of different words vs.
    learning a new tense or syntactic rule (or
    learning a new language, or the existence of
    different languages)
  • Concepts learning what horses look like vs.
    learning that there is a new species (or learning
    that there are species)
  • Intuitive physics learning the mass of an object
    vs. learning about gravity or angular momentum
  • Intuitive psychology learning a persons beliefs
    or goals vs. learning that there can be false
    beliefs, or that visual access is valuable for
    establishing true beliefs

32
A hierarchical learning framework
model
parameters
data
33
A hierarchical learning framework
model class
model
parameters
data
34
Bayesian parameter estimation
  • Assume data are generated from a model
  • What is the value of q ?
  • each value of q is a hypothesis H
  • requires inference over infinitely many hypotheses

q
d1 d2 d3 d4
P(H) q
35
Some intuitions
  • D 10 flips, with 5 heads and 5 tails.
  • q P(H) on next flip? 50
  • Why? 50 5 / (55) 5/10.
  • Why? The future will be like the past
  • Suppose we had seen 4 heads and 6 tails.
  • P(H) on next flip? Closer to 50 than to 40.
  • Why? Prior knowledge.

36
Integrating prior knowledge and data
  • Posterior distribution P(q D) is a probability
    density over q P(H)
  • Need to work out likelihood P(D q ) and specify
    prior distribution P(q )

37
Likelihood and prior
  • Likelihood Bernoulli distribution
  • P(D q ) q NH (1-q ) NT
  • NH number of heads
  • NT number of tails
  • Prior
  • P(q ) ?

?
38
Some intuitions
  • D 10 flips, with 5 heads and 5 tails.
  • q P(H) on next flip? 50
  • Why? 50 5 / (55) 5/10.
  • Why? Maximum likelihood
  • Suppose we had seen 4 heads and 6 tails.
  • P(H) on next flip? Closer to 50 than to 40.
  • Why? Prior knowledge.

39
A simple method of specifying priors
  • Imagine some fictitious trials, reflecting a set
    of previous experiences
  • strategy often used with neural networks or
    building invariance into machine vision.
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • In fact, this is a sensible statistical idea...

40
Likelihood and prior
  • Likelihood Bernoulli(q ) distribution
  • P(D q ) q NH (1-q ) NT
  • NH number of heads
  • NT number of tails
  • Prior Beta(FH,FT) distribution
  • P(q ) ? q FH-1 (1-q ) FT-1
  • FH fictitious observations of heads
  • FT fictitious observations of tails

41
Shape of the Beta prior
42
Shape of the Beta prior
FH 0.5, FT 0.5
FH 0.5, FT 2
FH 2, FT 0.5
FH 2, FT 2
43
Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
  • Posterior is Beta(NHFH,NTFT)
  • same form as prior!
  • expected P(H) (NHFH) / (NHFHNTFT)

44
Conjugate priors
  • A prior p(q ) is conjugate to a likelihood
    function p(D q ) if the posterior has the same
    functional form of the prior.
  • Parameter values in the prior can be thought of
    as a summary of fictitious observations.
  • Different parameter values in the prior and
    posterior reflect the impact of observed data.
  • Conjugate priors exist for many standard models
    (e.g., all exponential family models)

45
Some examples
  • e.g., F 1000 heads, 1000 tails strong
    expectation that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    1004 / (10041006) 49.95
  • e.g., F 3 heads, 3 tails weak expectation
    that any new coin will be fair
  • After seeing 4 heads, 6 tails, P(H) on next flip
    7 / (79) 43.75
  • Prior knowledge too weak

46
But flipping thumbtacks
  • e.g., F 4 heads, 3 tails weak expectation
    that tacks are slightly biased towards heads
  • After seeing 2 heads, 0 tails, P(H) on next flip
    6 / (63) 67
  • Some prior knowledge is always necessary to avoid
    jumping to hasty conclusions...
  • Suppose F After seeing 1 heads, 0 tails,
    P(H) on next flip 1 / (10) 100

47
Origin of prior knowledge
  • Tempting answer prior experience
  • Suppose you have previously seen 2000 coin flips
    1000 heads, 1000 tails

48
Problems with simple empiricism
  • Havent really seen 2000 coin flips, or any flips
    of a thumbtack
  • Prior knowledge is stronger than raw experience
    justifies
  • Havent seen exactly equal number of heads and
    tails
  • Prior knowledge is smoother than raw experience
    justifies
  • Should be a difference between observing 2000
    flips of a single coin versus observing 10 flips
    each for 200 coins, or 1 flip each for 2000 coins
  • Prior knowledge is more structured than raw
    experience

49
A simple theory
  • Coins are manufactured by a standardized
    procedure that is effective but not perfect, and
    symmetric with respect to heads and tails. Tacks
    are asymmetric, and manufactured to less exacting
    standards.
  • Justifies generalizing from previous coins to the
    present coin.
  • Justifies smoother and stronger prior than raw
    experience alone.
  • Explains why seeing 10 flips each for 200 coins
    is more valuable than seeing 2000 flips of one
    coin.

50
A hierarchical Bayesian model
physical knowledge
Coins
q Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
q200
q1
q2
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
  • Qualitative physical knowledge (symmetry) can
    influence estimates of continuous parameters (FH,
    FT).
  • Explains why 10 flips of 200 coins are better
    than 2000 flips of a single coin more
    informative about FH, FT.

51
Stability versus Flexibility
  • Can all domain knowledge be represented with
    conjugate priors?
  • Suppose you flip a coin 25 times and get all
    heads. Something funny is going on
  • But with F 1000 heads, 1000 tails, P(heads) on
    next flip 1025 / (10251000) 50.6. Looks
    like nothing unusual.
  • How do we balance stability and flexibility?
  • Stability 6 heads, 4 tails q 0.5
  • Flexibility 25 heads, 0 tails q 1

52
A hierarchical Bayesian model
fair/unfair?
  • Higher-order hypothesis is this coin fair or
    unfair?
  • Example probabilities
  • P(fair) 0.99
  • P(q fair) is Beta(1000,1000)
  • P(q unfair) is Beta(1,1)
  • 25 heads in a row propagates up, affecting q and
    then P(fairD)

FH,FT
q
d1 d2 d3 d4
53
Summary Bayesian parameter estimation
  • Learning the parameters of a generative model as
    Bayesian inference.
  • Conjugate priors
  • an elegant way to represent simple kinds of prior
    knowledge.
  • Hierarchical Bayesian models
  • integrate knowledge across instances of a system,
    or different systems within a domain.
  • can represent richer, more abstract knowledge

54
Some questions
  • Learning isnt just about parameter estimation
  • How do we learn the functional form of a
    variables distribution?
  • How do we learn model structure, or theories with
    the expressiveness of predicate logic?
  • Can we grow levels of abstraction?

55
Coin flipping
  • Basic Bayes
  • data HHTHT or HHHHH
  • compare two hypotheses
  • P(H) 0.5 vs. P(H) 1.0
  • Parameter estimation
  • compare many hypotheses in a parameterized family
  • P(H) q Infer q
  • Model selection
  • compare qualitatively different hypotheses, often
    varying in complexity
  • P(H) 0.5 vs. P(H) q

56
A hierarchical learning framework
model class
Model selection
model
parameters
data
57
Bayesian model selection
vs.
  • Which provides a better account of the data the
    simple hypothesis of a fair coin, or the complex
    hypothesis that P(H) q ?

58
Comparing simple and complex hypotheses
  • P(H) q is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) q
  • for any observed sequence D, we can choose q such
    that D is more probable than if P(H) 0.5

59
Comparing simple and complex hypotheses
Probability
q 0.5
60
Comparing simple and complex hypotheses
q 1.0
Probability
q 0.5
61
Comparing simple and complex hypotheses
Probability
q 0.6
q 0.5
D HHTHT
62
Comparing simple and complex hypotheses
  • P(H) q is more complex than P(H) 0.5 in two
    ways
  • P(H) 0.5 is a special case of P(H) q
  • for any observed sequence X, we can choose q such
    that X is more probable than if P(H) 0.5
  • How can we deal with this?
  • Some version of Occams razor?
  • Bayes automatic version of Occams razor follows
    from the law of conservation of belief.

63
Comparing simple and complex hypotheses
  • P(h1D) P(Dh1) P(h1)
  • P(h0D) P(Dh0) P(h0)

x
The evidence or marginal likelihood The
probability that randomly selected parameters
from the prior would generate the data.
64
(No Transcript)
65
Bayesian Occams Razor
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
66
Ockhams Razor in curve fitting
67
(No Transcript)
68
M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
M1 A model that is too simple is unlikely to
generate the data. M3 A model that
is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
69
M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
assume Gaussian parameter priors, Gaussian
likelihoods (noise)
70
For best fitting version of each model
M1
Prior
Likelihood
high
low
M2
medium
high
M3
very very very very low
very high
71
(assuming Gaussian noise, and Gaussian priors on
parameters)
(Ghahramani)
72
(Ghahramani)
73
Hierarchical Bayesian learning with flexibly
structured models
  • Learning context-free grammars for natural
    language (Stolcke Omohundro Griffiths and
    Johnson Perfors et al.).
  • Learning complex concepts.

fruit
fruit lt or( and(
color gt 0.2, color lt
0.4, size gt 1, size
lt 4 ), and(
color gt 0.5, color lt 0.65,
size gt 2, size lt 7),
)
Navarro (2006) Nonparametric model
Goodman et al. (2006) Probabilistic
context-free grammar for rule-based concepts
74
The blessing of abstraction
  • Often easier to learn at higher levels of
    abstraction
  • Easier to learn that you have a biased coin than
    to learn its bias.
  • Easier to learn causal structure than causal
    strength.
  • Easier to learn that you are hearing two
    languages (vs. one), or to learn that language
    has a hierarchical phrase structure, than to
    learn how any one language works.
  • Why? Hypothesis space gets smaller as you go up.
  • But the total hypothesis space gets bigger when
    we add levels of abstraction (e.g., model
    selection).
  • Can make better (more confident, more accurate)
    predictions by increasing the size of the
    hypothesis space, if we introduce good inductive
    biases.

75
Summary
  • Three kinds of Bayesian inference
  • Comparing two simple hypotheses
  • Parameter estimation
  • The importance and subtlety of prior knowledge
  • Model selection
  • Bayesian Occams razor, the blessing of
    abstraction
  • Key concepts
  • Probabilistic generative models
  • Hierarchies of abstraction, with statistical
    inference at all levels
  • Flexibly structured representations
Write a Comment
User Comments (0)
About PowerShow.com