Bayesian models of inductive learning

Tom Griffiths UC Berkeley

Josh Tenenbaum MIT

Charles Kemp MIT

What to expect

- What youll get out of this tutorial
- Our view of what Bayesian models have to offer

cognitive science. - In-depth examples of basic and advanced models

how the math works what it buys you. - Some (not extensive) comparison to other

approaches. - Opportunities to ask questions.
- What you wont get
- Detailed, hands-on how-to.
- Where you can learn more
- http//bayesiancognition.com
- Trends in Cognitive Sciences, July 2006, special

issue on Probabilistic Models of Cognition.

Outline

- Morning
- Introduction Why Bayes? (Josh)
- Basic of Bayesian inference (Josh)
- Graphical models, causal inference and learning

(Tom) - Afternoon
- Hierarchical Bayesian models, property induction,

and learning domain structures (Charles) - Methods of approximate learning and inference,

probabilistic models of semantic memory (Tom)

Why Bayes?

- The problem of induction
- How does the mind form inferences,

generalizations, models or theories about the

world from impoverished data? - Induction is ubiquitous in cognition
- Vision ( audition, touch, or other perceptual

modalities) - Language (understanding, production)
- Concepts (semantic knowledge, common sense)
- Causal learning and reasoning
- Decision-making and action (production,

understanding) - Bayes gives a general framework for explaining

how induction can work in principle, and perhaps,

how it does work in the mind.

Grammar G

P(S G)

Phrase structure S

P(U S)

Utterance U

P(S U, G) P(U S) x P(S G)

Bottom-up Top-down

Universal Grammar

Hierarchical phrase structure grammars (e.g.,

CFG, HPSG, TAG)

Grammar

Phrase structure

Utterance

Speech signal

The approach

- Key concepts
- Inference in probabilistic generative models
- Hierarchical probabilistic models, with inference

at all levels of abstraction - Structured knowledge representations graphs,

grammars, predicate logic, schemas, theories - Flexible structures, with complexity constrained

by Bayesian Occams razor - Approximate methods of learning and inference

Expectation-Maximization (EM), Markov chain Monte

Carlo (MCMC) - Much recent progress!
- Computational resources to implement and test

models that we could dream up but not

realistically imagine working with - New theoretical tools let us develop models that

we could not clearly conceive of before.

Vision as probabilistic parsing

(Han and Zhu, 2006)

(No Transcript)

Word learning on planet Gazoob

- Can you pick out the tufas?

Learning word meanings

Whole-object principle Shape bias Taxonomic

principle Contrast principle Basic-level bias

Principles

Structure

Data

Causal learning and reasoning

Principles

Structure

Data

Goal-directed action (production and

comprehension)

(Wolpert et al., 2003)

Marrs Three Levels of Analysis

- Computation
- What is the goal of the computation, why is it

appropriate, and what is the logic of the

strategy by which it can be carried out? - Algorithm
- Cognitive psychology
- Implementation
- Neurobiology

Alternative approaches to inductive learning and

inference

- Associative learning
- Connectionist networks
- Similarity to examples
- Toolkit of simple heuristics
- Constraint satisfaction
- Analogical mapping

Summary Why Bayes?

- A unifying framework for explaining cognition.
- How people can learn so much from such limited

data. - Strong quantitative models with minimal ad hoc

assumptions. - Why algorithmic-level models work the way they

do. - A framework for understanding how structured

knowledge and statistical inference interact. - How structured knowledge guides statistical

inference, and may itself be acquired through

statistical means. - What forms knowledge takes, at multiple levels of

abstraction. - What knowledge must be innate, and what can be

learned. - How flexible knowledge structures may grow as

required by the data, with complexity controlled

by Occams razor.

Outline

- Morning
- Introduction Why Bayes? (Josh)
- Basic of Bayesian inference (Josh)
- Graphical models, causal inference and learning

(Tom) - Afternoon
- Hierarchical Bayesian models, property induction,

and learning domain structures (Charles) - Methods of approximate learning and inference,

probabilistic models of semantic memory (Tom)

Bayes rule

For any hypothesis h and data d,

Sum over space of alternative hypotheses

Bayesian inference

- Bayes rule
- An example
- Data John is coughing
- Some hypotheses
- John has a cold
- John has emphysema
- John has a stomach flu
- Prior favors 1 and 3 over 2
- Likelihood P(dh) favors 1 and 2 over 3
- Posterior P(dh) favors 1 over 2 and 3

Coin flipping

- Basic Bayes
- data HHTHT or HHHHH
- compare two simple hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation (Model fitting)
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often

varying in complexity - P(H) 0.5 vs. P(H) q

Coin flipping

HHTHT

HHHHH

What process produced these sequences?

Comparing two simple hypotheses

- Contrast simple hypotheses
- h1 fair coin, P(H) 0.5
- h2always heads, P(H) 1.0
- Bayes rule
- With two hypotheses, use odds form

Comparing two simple hypotheses

- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) ?
- P(DH2) 0 P(H2) 1-?

Comparing two simple hypotheses

- D HHTHT
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 0 P(H2) 1/1000

Comparing two simple hypotheses

- D HHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/25 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000

Comparing two simple hypotheses

- D HHHHHHHHHH
- H1, H2 fair coin, always heads
- P(DH1) 1/210 P(H1) 999/1000
- P(DH2) 1 P(H2) 1/1000

The role of intuitive theories

- The fact that HHTHT looks representative of a

fair coin and HHHHH does not reflects our

implicit theories of how the world works. - Easy to imagine how a trick all-heads coin could

work high prior probability. - Hard to imagine how a trick HHTHT coin could

work low prior probability.

Coin flipping

- Basic Bayes
- data HHTHT or HHHHH
- compare two hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation (Model fitting)
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often

varying in complexity - P(H) 0.5 vs. P(H) q

Parameter estimation

- Assume data are generated from a parameterized

model - What is the value of q ?
- each value of q is a hypothesis H
- requires inference over infinitely many hypotheses

q

d1 d2 d3 d4

P(H) q

Model selection

- Assume hypothesis space of possible models
- Which model generated the data?
- requires summing out hidden variables
- requires some form of Occams razor to trade off

complexity with fit to the data.

q

d1

d2

d3

d4

d1

d2

d3

d4

d1

d2

d3

d4

Hidden Markov model si Fair coin, Trick

coin

Fair coin P(H) 0.5

P(H) q

Parameter estimation vs. Model selection across

learning and development

- Causality learning the strength of a relation

vs. learning the existence and form of a relation - Language acquisition learning a speaker's

accent, or frequencies of different words vs.

learning a new tense or syntactic rule (or

learning a new language, or the existence of

different languages) - Concepts learning what horses look like vs.

learning that there is a new species (or learning

that there are species) - Intuitive physics learning the mass of an object

vs. learning about gravity or angular momentum - Intuitive psychology learning a persons beliefs

or goals vs. learning that there can be false

beliefs, or that visual access is valuable for

establishing true beliefs

A hierarchical learning framework

model

parameters

data

A hierarchical learning framework

model class

model

parameters

data

Bayesian parameter estimation

- Assume data are generated from a model
- What is the value of q ?
- each value of q is a hypothesis H
- requires inference over infinitely many hypotheses

q

d1 d2 d3 d4

P(H) q

Some intuitions

- D 10 flips, with 5 heads and 5 tails.
- q P(H) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Why? The future will be like the past
- Suppose we had seen 4 heads and 6 tails.
- P(H) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.

Integrating prior knowledge and data

- Posterior distribution P(q D) is a probability

density over q P(H) - Need to work out likelihood P(D q ) and specify

prior distribution P(q )

Likelihood and prior

- Likelihood Bernoulli distribution
- P(D q ) q NH (1-q ) NT
- NH number of heads
- NT number of tails
- Prior
- P(q ) ?

?

Some intuitions

- D 10 flips, with 5 heads and 5 tails.
- q P(H) on next flip? 50
- Why? 50 5 / (55) 5/10.
- Why? Maximum likelihood
- Suppose we had seen 4 heads and 6 tails.
- P(H) on next flip? Closer to 50 than to 40.
- Why? Prior knowledge.

A simple method of specifying priors

- Imagine some fictitious trials, reflecting a set

of previous experiences - strategy often used with neural networks or

building invariance into machine vision. - e.g., F 1000 heads, 1000 tails strong

expectation that any new coin will be fair - In fact, this is a sensible statistical idea...

Likelihood and prior

- Likelihood Bernoulli(q ) distribution
- P(D q ) q NH (1-q ) NT
- NH number of heads
- NT number of tails
- Prior Beta(FH,FT) distribution
- P(q ) ? q FH-1 (1-q ) FT-1
- FH fictitious observations of heads
- FT fictitious observations of tails

Shape of the Beta prior

Shape of the Beta prior

FH 0.5, FT 0.5

FH 0.5, FT 2

FH 2, FT 0.5

FH 2, FT 2

Bayesian parameter estimation

P(q D) ? P(D q ) P(q ) q NHFH-1 (1-q )

NTFT-1

- Posterior is Beta(NHFH,NTFT)
- same form as prior!
- expected P(H) (NHFH) / (NHFHNTFT)

Conjugate priors

- A prior p(q ) is conjugate to a likelihood

function p(D q ) if the posterior has the same

functional form of the prior. - Parameter values in the prior can be thought of

as a summary of fictitious observations. - Different parameter values in the prior and

posterior reflect the impact of observed data. - Conjugate priors exist for many standard models

(e.g., all exponential family models)

Some examples

- e.g., F 1000 heads, 1000 tails strong

expectation that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip

1004 / (10041006) 49.95 - e.g., F 3 heads, 3 tails weak expectation

that any new coin will be fair - After seeing 4 heads, 6 tails, P(H) on next flip

7 / (79) 43.75 - Prior knowledge too weak

But flipping thumbtacks

- e.g., F 4 heads, 3 tails weak expectation

that tacks are slightly biased towards heads - After seeing 2 heads, 0 tails, P(H) on next flip

6 / (63) 67 - Some prior knowledge is always necessary to avoid

jumping to hasty conclusions... - Suppose F After seeing 1 heads, 0 tails,

P(H) on next flip 1 / (10) 100

Origin of prior knowledge

- Tempting answer prior experience
- Suppose you have previously seen 2000 coin flips

1000 heads, 1000 tails

Problems with simple empiricism

- Havent really seen 2000 coin flips, or any flips

of a thumbtack - Prior knowledge is stronger than raw experience

justifies - Havent seen exactly equal number of heads and

tails - Prior knowledge is smoother than raw experience

justifies - Should be a difference between observing 2000

flips of a single coin versus observing 10 flips

each for 200 coins, or 1 flip each for 2000 coins - Prior knowledge is more structured than raw

experience

A simple theory

- Coins are manufactured by a standardized

procedure that is effective but not perfect, and

symmetric with respect to heads and tails. Tacks

are asymmetric, and manufactured to less exacting

standards. - Justifies generalizing from previous coins to the

present coin. - Justifies smoother and stronger prior than raw

experience alone. - Explains why seeing 10 flips each for 200 coins

is more valuable than seeing 2000 flips of one

coin.

A hierarchical Bayesian model

physical knowledge

Coins

q Beta(FH,FT)

FH,FT

...

Coin 1

Coin 2

Coin 200

q200

q1

q2

d1 d2 d3 d4

d1 d2 d3 d4

d1 d2 d3 d4

- Qualitative physical knowledge (symmetry) can

influence estimates of continuous parameters (FH,

FT).

- Explains why 10 flips of 200 coins are better

than 2000 flips of a single coin more

informative about FH, FT.

Stability versus Flexibility

- Can all domain knowledge be represented with

conjugate priors? - Suppose you flip a coin 25 times and get all

heads. Something funny is going on - But with F 1000 heads, 1000 tails, P(heads) on

next flip 1025 / (10251000) 50.6. Looks

like nothing unusual. - How do we balance stability and flexibility?
- Stability 6 heads, 4 tails q 0.5
- Flexibility 25 heads, 0 tails q 1

A hierarchical Bayesian model

fair/unfair?

- Higher-order hypothesis is this coin fair or

unfair? - Example probabilities
- P(fair) 0.99
- P(q fair) is Beta(1000,1000)
- P(q unfair) is Beta(1,1)
- 25 heads in a row propagates up, affecting q and

then P(fairD)

FH,FT

q

d1 d2 d3 d4

Summary Bayesian parameter estimation

- Learning the parameters of a generative model as

Bayesian inference. - Conjugate priors
- an elegant way to represent simple kinds of prior

knowledge. - Hierarchical Bayesian models
- integrate knowledge across instances of a system,

or different systems within a domain. - can represent richer, more abstract knowledge

Some questions

- Learning isnt just about parameter estimation
- How do we learn the functional form of a

variables distribution? - How do we learn model structure, or theories with

the expressiveness of predicate logic? - Can we grow levels of abstraction?

Coin flipping

- Basic Bayes
- data HHTHT or HHHHH
- compare two hypotheses
- P(H) 0.5 vs. P(H) 1.0
- Parameter estimation
- compare many hypotheses in a parameterized family
- P(H) q Infer q
- Model selection
- compare qualitatively different hypotheses, often

varying in complexity - P(H) 0.5 vs. P(H) q

A hierarchical learning framework

model class

Model selection

model

parameters

data

Bayesian model selection

vs.

- Which provides a better account of the data the

simple hypothesis of a fair coin, or the complex

hypothesis that P(H) q ?

Comparing simple and complex hypotheses

- P(H) q is more complex than P(H) 0.5 in two

ways - P(H) 0.5 is a special case of P(H) q
- for any observed sequence D, we can choose q such

that D is more probable than if P(H) 0.5

Comparing simple and complex hypotheses

Probability

q 0.5

Comparing simple and complex hypotheses

q 1.0

Probability

q 0.5

Comparing simple and complex hypotheses

Probability

q 0.6

q 0.5

D HHTHT

Comparing simple and complex hypotheses

- P(H) q is more complex than P(H) 0.5 in two

ways - P(H) 0.5 is a special case of P(H) q
- for any observed sequence X, we can choose q such

that X is more probable than if P(H) 0.5 - How can we deal with this?
- Some version of Occams razor?
- Bayes automatic version of Occams razor follows

from the law of conservation of belief.

Comparing simple and complex hypotheses

- P(h1D) P(Dh1) P(h1)
- P(h0D) P(Dh0) P(h0)

x

The evidence or marginal likelihood The

probability that randomly selected parameters

from the prior would generate the data.

(No Transcript)

Bayesian Occams Razor

For any model M,

Law of conservation of belief A model that can

predict many possible data sets must assign each

of them low probability.

Ockhams Razor in curve fitting

(No Transcript)

M1

M1

p(D d M )

M2

M2

M3

D

Observed data

M3

M1 A model that is too simple is unlikely to

generate the data. M3 A model that

is too complex can generate many

possible data sets, so it is unlikely to generate

this particular data set at random.

M1

M1

p(D d M )

M2

M2

M3

D

Observed data

M3

assume Gaussian parameter priors, Gaussian

likelihoods (noise)

For best fitting version of each model

M1

Prior

Likelihood

high

low

M2

medium

high

M3

very very very very low

very high

(assuming Gaussian noise, and Gaussian priors on

parameters)

(Ghahramani)

(Ghahramani)

Hierarchical Bayesian learning with flexibly

structured models

- Learning context-free grammars for natural

language (Stolcke Omohundro Griffiths and

Johnson Perfors et al.). - Learning complex concepts.

fruit

fruit lt or( and(

color gt 0.2, color lt

0.4, size gt 1, size

lt 4 ), and(

color gt 0.5, color lt 0.65,

size gt 2, size lt 7),

)

Navarro (2006) Nonparametric model

Goodman et al. (2006) Probabilistic

context-free grammar for rule-based concepts

The blessing of abstraction

- Often easier to learn at higher levels of

abstraction - Easier to learn that you have a biased coin than

to learn its bias. - Easier to learn causal structure than causal

strength. - Easier to learn that you are hearing two

languages (vs. one), or to learn that language

has a hierarchical phrase structure, than to

learn how any one language works. - Why? Hypothesis space gets smaller as you go up.
- But the total hypothesis space gets bigger when

we add levels of abstraction (e.g., model

selection). - Can make better (more confident, more accurate)

predictions by increasing the size of the

hypothesis space, if we introduce good inductive

biases.

Summary

- Three kinds of Bayesian inference
- Comparing two simple hypotheses
- Parameter estimation
- The importance and subtlety of prior knowledge
- Model selection
- Bayesian Occams razor, the blessing of

abstraction - Key concepts
- Probabilistic generative models
- Hierarchies of abstraction, with statistical

inference at all levels - Flexibly structured representations