Title: Bayesian%20models%20of%20inductive%20learning
1Bayesian models of inductive learning
Tom Griffiths UC Berkeley
Josh Tenenbaum MIT
Charles Kemp MIT
2What to expect
- What youll get out of this tutorial
- Our view of what Bayesian models have to offer
cognitive science. - In-depth examples of basic and advanced models
how the math works what it buys you. - Some (not extensive) comparison to other
approaches. - Opportunities to ask questions.
- What you wont get
- Detailed, hands-on how-to.
- Where you can learn more
- http//bayesiancognition.com
- Trends in Cognitive Sciences, July 2006, special
issue on Probabilistic Models of Cognition.
3Outline
- Morning
- Introduction Why Bayes? (Josh)
- Basic of Bayesian inference (Josh)
- Graphical models, causal inference and learning
(Tom) - Afternoon
- Hierarchical Bayesian models, property induction,
and learning domain structures (Charles) - Methods of approximate learning and inference,
probabilistic models of semantic memory (Tom)
4Why Bayes?
- The problem of induction
- How does the mind form inferences,
generalizations, models or theories about the
world from impoverished data? - Induction is ubiquitous in cognition
- Vision ( audition, touch, or other perceptual
modalities) - Language (understanding, production)
- Concepts (semantic knowledge, common sense)
- Causal learning and reasoning
- Decision-making and action (production,
understanding) - Bayes gives a general framework for explaining
how induction can work in principle, and perhaps,
how it does work in the mind.
5Grammar G
P(S G)
Phrase structure S
P(U S)
Utterance U
P(S U, G) P(U S) x P(S G)
Bottom-up Top-down
6Universal Grammar
Hierarchical phrase structure grammars (e.g.,
CFG, HPSG, TAG)
Grammar
Phrase structure
Utterance
Speech signal
7The approach
- Key concepts
- Inference in probabilistic generative models
- Hierarchical probabilistic models, with inference
at all levels of abstraction - Structured knowledge representations graphs,
grammars, predicate logic, schemas, theories - Flexible structures, with complexity constrained
by Bayesian Occams razor - Approximate methods of learning and inference
Expectation-Maximization (EM), Markov chain Monte
Carlo (MCMC) - Much recent progress!
- Computational resources to implement and test
models that we could dream up but not
realistically imagine working with - New theoretical tools let us develop models that
we could not clearly conceive of before.
8Vision as probabilistic parsing
(Han and Zhu, 2006)
9(No Transcript)
10Word learning on planet Gazoob
- Can you pick out the tufas?
11Learning word meanings
Whole-object principle Shape bias Taxonomic
principle Contrast principle Basic-level bias
Principles
Structure
Data
12Causal learning and reasoning
Principles
Structure
Data
13Goal-directed action (production and
comprehension)
(Wolpert et al., 2003)
14Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Algorithm
- Cognitive psychology
- Implementation
- Neurobiology
15Alternative approaches to inductive learning and
inference
- Associative learning
- Connectionist networks
- Similarity to examples
- Toolkit of simple heuristics
- Constraint satisfaction
- Analogical mapping
16Summary Why Bayes?
- A unifying framework for explaining cognition.
- How people can learn so much from such limited
data. - Strong quantitative models with minimal ad hoc
assumptions. - Why algorithmic-level models work the way they
do. - A framework for understanding how structured
knowledge and statistical inference interact. - How structured knowledge guides statistical
inference, and may itself be acquired through
statistical means. - What forms knowledge takes, at multiple levels of
abstraction. - What knowledge must be innate, and what can be
learned. - How flexible knowledge structures may grow as
required by the data, with complexity controlled
by Occams razor.
17Outline
- Morning
- Introduction Why Bayes? (Josh)
- Basic of Bayesian inference (Josh)
- Graphical models, causal inference and learning
(Tom) - Afternoon
- Hierarchical Bayesian models, property induction,
and learning domain structures (Charles) - Methods of approximate learning and inference,
probabilistic models of semantic memory (Tom)
18Bayes rule
For any hypothesis h and data d,
Sum over space of alternative hypotheses
19Bayesian inference
- Bayes rule
- An example
- Data John is coughing
- Some hypotheses
- John has a cold
- John has emphysema
- John has a stomach flu
- Prior favors 1 and 3 over 2
- Likelihood P(dh) favors 1 and 2 over 3
- Posterior P(dh) favors 1 over 2 and 3
20Coin flipping
- Basic Bayes
- data HHTHT or HHHHH
- compare two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation (Model fitting)
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often
varying in complexity - P(H) 0.5 vs. P(H) q
21Coin flipping
HHTHT
HHHHH
What process produced these sequences?
22Comparing two simple hypotheses
- Contrast simple hypotheses
- h1 fair coin, P(H) 0.5
- h2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form
23Comparing two simple hypotheses
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) ?
- P(DH2) 0 P(H2) 1-?
24Comparing two simple hypotheses
- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000
25Comparing two simple hypotheses
- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
26Comparing two simple hypotheses
- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000
-
27The role of intuitive theories
- The fact that HHTHT looks representative of a
fair coin and HHHHH does not reflects our
implicit theories of how the world works. - Easy to imagine how a trick all-heads coin could
work high prior probability. - Hard to imagine how a trick HHTHT coin could
work low prior probability.
28Coin flipping
- Basic Bayes
- data HHTHT or HHHHH
- compare two hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation (Model fitting)
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often
varying in complexity - P(H) 0.5 vs. P(H) q
29Parameter estimation
- Assume data are generated from a parameterized
model - What is the value of q ?
- each value of q is a hypothesis H
- requires inference over infinitely many hypotheses
q
d1 d2 d3 d4
P(H) q
30Model selection
- Assume hypothesis space of possible models
- Which model generated the data?
- requires summing out hidden variables
- requires some form of Occams razor to trade off
complexity with fit to the data.
q
d1
d2
d3
d4
d1
d2
d3
d4
d1
d2
d3
d4
Hidden Markov model si Fair coin, Trick
coin
Fair coin P(H) 0.5
P(H) q
31Parameter estimation vs. Model selection across
learning and development
- Causality learning the strength of a relation
vs. learning the existence and form of a relation - Language acquisition learning a speaker's
accent, or frequencies of different words vs.
learning a new tense or syntactic rule (or
learning a new language, or the existence of
different languages) - Concepts learning what horses look like vs.
learning that there is a new species (or learning
that there are species) - Intuitive physics learning the mass of an object
vs. learning about gravity or angular momentum - Intuitive psychology learning a persons beliefs
or goals vs. learning that there can be false
beliefs, or that visual access is valuable for
establishing true beliefs
32A hierarchical learning framework
model
parameters
data
33A hierarchical learning framework
model class
model
parameters
data
34Bayesian parameter estimation
- Assume data are generated from a model
- What is the value of q ?
- each value of q is a hypothesis H
- requires inference over infinitely many hypotheses
q
d1 d2 d3 d4
P(H) q
35Some intuitions
- D 10 flips, with 5 heads and 5 tails.
- q P(H) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Why? The future will be like the past
- Suppose we had seen 4 heads and 6 tails.
- P(H) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.
36Integrating prior knowledge and data
- Posterior distribution P(q D) is a probability
density over q P(H) - Need to work out likelihood P(D q ) and specify
prior distribution P(q )
37Likelihood and prior
- Likelihood Bernoulli distribution
- P(D q ) q NH (1-q ) NT
- NH number of heads
- NT number of tails
- Prior
- P(q ) ?
?
38Some intuitions
- D 10 flips, with 5 heads and 5 tails.
- q P(H) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Why? Maximum likelihood
- Suppose we had seen 4 heads and 6 tails.
- P(H) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.
39A simple method of specifying priors
- Imagine some fictitious trials, reflecting a set
of previous experiences - strategy often used with neural networks or
building invariance into machine vision. - e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair - In fact, this is a sensible statistical idea...
40Likelihood and prior
- Likelihood Bernoulli(q ) distribution
- P(D q ) q NH (1-q ) NT
- NH number of heads
- NT number of tails
- Prior Beta(FH,FT) distribution
- P(q ) ? q FH-1 (1-q ) FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails
41Shape of the Beta prior
42Shape of the Beta prior
FH 0.5, FT 0.5
FH 0.5, FT 2
FH 2, FT 0.5
FH 2, FT 2
43Bayesian parameter estimation
P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )
NTFT-1
- Posterior is Beta(NHFH,NTFT)
- same form as prior!
- expected P(H) (NHFH) / (NHFHNTFT)
44Conjugate priors
- A prior p(q ) is conjugate to a likelihood
function p(D q ) if the posterior has the same
functional form of the prior. - Parameter values in the prior can be thought of
as a summary of fictitious observations. - Different parameter values in the prior and
posterior reflect the impact of observed data. - Conjugate priors exist for many standard models
(e.g., all exponential family models)
45Some examples
- e.g., F 1000 heads, 1000 tails strong
expectation that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip
1004 / (10041006) 49.95 - e.g., F 3 heads, 3 tails weak expectation
that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip
7 / (79) 43.75 - Prior knowledge too weak
46But flipping thumbtacks
- e.g., F 4 heads, 3 tails weak expectation
that tacks are slightly biased towards heads - After seeing 2 heads, 0 tails, P(H) on next flip
6 / (63) 67 - Some prior knowledge is always necessary to avoid
jumping to hasty conclusions... - Suppose F After seeing 1 heads, 0 tails,
P(H) on next flip 1 / (10) 100
47Origin of prior knowledge
- Tempting answer prior experience
- Suppose you have previously seen 2000 coin flips
1000 heads, 1000 tails
48Problems with simple empiricism
- Havent really seen 2000 coin flips, or any flips
of a thumbtack - Prior knowledge is stronger than raw experience
justifies - Havent seen exactly equal number of heads and
tails - Prior knowledge is smoother than raw experience
justifies - Should be a difference between observing 2000
flips of a single coin versus observing 10 flips
each for 200 coins, or 1 flip each for 2000 coins - Prior knowledge is more structured than raw
experience
49A simple theory
- Coins are manufactured by a standardized
procedure that is effective but not perfect, and
symmetric with respect to heads and tails. Tacks
are asymmetric, and manufactured to less exacting
standards. - Justifies generalizing from previous coins to the
present coin. - Justifies smoother and stronger prior than raw
experience alone. - Explains why seeing 10 flips each for 200 coins
is more valuable than seeing 2000 flips of one
coin.
50A hierarchical Bayesian model
physical knowledge
Coins
q Beta(FH,FT)
FH,FT
...
Coin 1
Coin 2
Coin 200
q200
q1
q2
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
- Qualitative physical knowledge (symmetry) can
influence estimates of continuous parameters (FH,
FT).
- Explains why 10 flips of 200 coins are better
than 2000 flips of a single coin more
informative about FH, FT.
51Stability versus Flexibility
- Can all domain knowledge be represented with
conjugate priors? - Suppose you flip a coin 25 times and get all
heads. Something funny is going on - But with F 1000 heads, 1000 tails, P(heads) on
next flip 1025 / (10251000) 50.6. Looks
like nothing unusual. - How do we balance stability and flexibility?
- Stability 6 heads, 4 tails q 0.5
- Flexibility 25 heads, 0 tails q 1
52A hierarchical Bayesian model
fair/unfair?
- Higher-order hypothesis is this coin fair or
unfair? - Example probabilities
- P(fair) 0.99
- P(q fair) is Beta(1000,1000)
- P(q unfair) is Beta(1,1)
- 25 heads in a row propagates up, affecting q and
then P(fairD)
FH,FT
q
d1 d2 d3 d4
53Summary Bayesian parameter estimation
- Learning the parameters of a generative model as
Bayesian inference. - Conjugate priors
- an elegant way to represent simple kinds of prior
knowledge. - Hierarchical Bayesian models
- integrate knowledge across instances of a system,
or different systems within a domain. - can represent richer, more abstract knowledge
54Some questions
- Learning isnt just about parameter estimation
- How do we learn the functional form of a
variables distribution? - How do we learn model structure, or theories with
the expressiveness of predicate logic? - Can we grow levels of abstraction?
55Coin flipping
- Basic Bayes
- data HHTHT or HHHHH
- compare two hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often
varying in complexity - P(H) 0.5 vs. P(H) q
56A hierarchical learning framework
model class
Model selection
model
parameters
data
57Bayesian model selection
vs.
- Which provides a better account of the data the
simple hypothesis of a fair coin, or the complex
hypothesis that P(H) q ?
58Comparing simple and complex hypotheses
- P(H) q is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) q
- for any observed sequence D, we can choose q such
that D is more probable than if P(H) 0.5
59Comparing simple and complex hypotheses
Probability
q 0.5
60Comparing simple and complex hypotheses
q 1.0
Probability
q 0.5
61Comparing simple and complex hypotheses
Probability
q 0.6
q 0.5
D HHTHT
62Comparing simple and complex hypotheses
- P(H) q is more complex than P(H) 0.5 in two
ways - P(H) 0.5 is a special case of P(H) q
- for any observed sequence X, we can choose q such
that X is more probable than if P(H) 0.5 - How can we deal with this?
- Some version of Occams razor?
- Bayes automatic version of Occams razor follows
from the law of conservation of belief.
63Comparing simple and complex hypotheses
- P(h1D) P(Dh1) P(h1)
- P(h0D) P(Dh0) P(h0)
x
The evidence or marginal likelihood The
probability that randomly selected parameters
from the prior would generate the data.
64(No Transcript)
65Bayesian Occams Razor
For any model M,
Law of conservation of belief A model that can
predict many possible data sets must assign each
of them low probability.
66Ockhams Razor in curve fitting
67(No Transcript)
68M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
M1 A model that is too simple is unlikely to
generate the data. M3 A model that
is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
69M1
M1
p(D d M )
M2
M2
M3
D
Observed data
M3
assume Gaussian parameter priors, Gaussian
likelihoods (noise)
70For best fitting version of each model
M1
Prior
Likelihood
high
low
M2
medium
high
M3
very very very very low
very high
71(assuming Gaussian noise, and Gaussian priors on
parameters)
(Ghahramani)
72(Ghahramani)
73Hierarchical Bayesian learning with flexibly
structured models
- Learning context-free grammars for natural
language (Stolcke Omohundro Griffiths and
Johnson Perfors et al.). - Learning complex concepts.
fruit
fruit lt or( and(
color gt 0.2, color lt
0.4, size gt 1, size
lt 4 ), and(
color gt 0.5, color lt 0.65,
size gt 2, size lt 7),
)
Navarro (2006) Nonparametric model
Goodman et al. (2006) Probabilistic
context-free grammar for rule-based concepts
74The blessing of abstraction
- Often easier to learn at higher levels of
abstraction - Easier to learn that you have a biased coin than
to learn its bias. - Easier to learn causal structure than causal
strength. - Easier to learn that you are hearing two
languages (vs. one), or to learn that language
has a hierarchical phrase structure, than to
learn how any one language works. - Why? Hypothesis space gets smaller as you go up.
- But the total hypothesis space gets bigger when
we add levels of abstraction (e.g., model
selection). - Can make better (more confident, more accurate)
predictions by increasing the size of the
hypothesis space, if we introduce good inductive
biases.
75Summary
- Three kinds of Bayesian inference
- Comparing two simple hypotheses
- Parameter estimation
- The importance and subtlety of prior knowledge
- Model selection
- Bayesian Occams razor, the blessing of
abstraction - Key concepts
- Probabilistic generative models
- Hierarchies of abstraction, with statistical
inference at all levels - Flexibly structured representations