Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences

About This Presentation

Title:

Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences

Description:

Bayesian Models of Human Learning and Inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Shiffrin Says Progress in science is driven by new ... – PowerPoint PPT presentation

Number of Views:506

Avg rating:3.0/5.0

Slides: 158

Provided by: joshten

Learn more at: http://www.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences

1
Bayesian Models of Human Learning and Inference
Josh TenenbaumMITDepartment of Brain and
Cognitive Sciences
2
Shiffrin Says

Progress in science is driven by new tools, not
great insights.

3
Outline

Part I. Brief survey of Bayesian modeling in
cognitive science.
Part II. Bayesian models of everyday inductive
leaps.

4
Collaborators

Tom Griffiths Neville Sanjana
Charles Kemp Mark Steyvers
Tevye Krynski Sean Stromsten
Sourabh Niyogi
Fei Xu Dave Sobel
Wheeler Ruml Alison Gopnik

5
Collaborators

Tom Griffiths Neville Sanjana
Charles Kemp Mark Steyvers
Tevye Krynski Sean Stromsten
Sourabh Niyogi
Fei Xu Dave Sobel
Wheeler Ruml Alison Gopnik

6
Outline

Part I. Brief survey of Bayesian modeling in
cognitive science.
Rational benchmark for descriptive models of
probability judgment.
Rational analysis of cognition
Rational tools for fitting cognitive models

7
Normative benchmark for descriptive models

How does human probability judgment compare to
the Bayesian ideal?
Peterson Beach, Edwards, Tversky Kahneman, .
. . .
Explicit probability judgment tasks
Drawing balls from an urn, rolling dice, medical
diagnosis, . . . .
Alternative descriptive models
Heuristics and Biases, Support Theory, . . . .

8
Rational analysis of cognition

Develop Bayesian models for core aspects of
cognition not traditionally thought of in terms
of statistical inference.
Examples
Memory retrieval Anderson Shiffrin et al, . . .
.
Reasoning with rules Oaksford Chater, . . . .

9
Rational analysis of cognition

Often can explain a wider range of phenomena than
previous models, with fewer free parameters.

Spacing effects on retention
Power laws of practice and retention
10
Rational analysis of cognition

Often can explain a wider range of phenomena than
previous models, with fewer free parameters.
Andersons rational analysis of memory

For each item in memory, estimate the probability
that it will be useful in the present context.
Model of need probability inspired by library
book access. Corresponds to statistics of
natural information sources

11
Rational analysis of cognition

Often can explain a wider range of phenomena than
previous models, with fewer free parameters.
Andersons rational analysis of memory

For each item in memory, estimate the probability
that it will be useful in the present context.
Model of need probability inspired by library
book access. Corresponds to statistics of
natural information sources

Short lag Long lag
Log need odds
Log days since last occurrence
12
Rational analysis of cognition

Often can show that apparently irrational
behavior is actually rational.

Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
13
Rational analysis of cognition

Often can show that apparently irrational
behavior is actually rational.
Oaksford Chaters rational analysis

Optimal data selection based on maximizing
expected information gain.
Test the rule If p, then q against the null
hypothesis that p and q are independent.
Assuming p and q are rare predicts peoples
choices

14
Rational tools for fitting cognitive models

Use Bayesian Occams Razor to solve the problem
of model selection trade off fit to the data
with model complexity.
Examples
Comparing alternative cognitive models Myung,
Pitt, . . . .
Fitting nested families of models of mental
representation Lee, Navarro, . . . .

15
Rational tools for fitting cognitive models

Comparing alternative cognitive models via an MDL
approximation to the Bayesian Occams Razor takes
into account the functional form of a model as
well as the number of free parameters.

16
Rational tools for fitting cognitive models

Fit models of mental representation to similarity
data, e.g. additive clustering, additive trees,
common and distinctive feature models.

Want to choose the complexity of the model
(number of features, depth of tree) in a
principled way, and search efficiently through
the space of nested models. Using Bayesian
Occams Razor

17
Outline

Part I. Brief survey of Bayesian modeling in
cognitive science.
Part II. Bayesian models of everyday inductive
leaps.

Rational models of cognition where Bayesian model
selection, Bayesian Occams Razor play central
explanatory role.
18
Everyday inductive leaps

How can we learn so much about . . .
Properties of natural kinds
Meanings of words
Future outcomes of a dynamic process
Hidden causal properties of an object
Causes of a persons action (beliefs, goals)
Causal laws governing a domain
. . . from such limited data?

19
Learning concepts and words
20
Learning concepts and words

Can you pick out the tufas?

21
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
22
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
23
The Challenge

How do we generalize successfully from very
limited data?
Just one or a few examples
Often only positive examples
Philosophy
Induction is a problem, a riddle, a
paradox, a scandal, or a myth.
Machine learning and statistics
Focus on generalization from many examples, both
positive and negative.

24
Rational statistical inference(Bayes, Laplace)
25
History of Bayesian Approaches to Human Inductive
Learning

Hunt

26
History of Bayesian Approaches to Human Inductive
Learning

Hunt
Suppes
Observable changes of hypotheses under positive
reinforcement, Science (1965), w/ M. Schlag-Rey.
A tentative interpretation is that, when the set
of hypotheses is large, the subject samples or
attends to several hypotheses simultaneously. . .
. It is also conceivable that a subject might
sample spontaneously, at any time, or under
stimulations other than those planned by the
experimenter. A more detailed exploration of
these ideas, including a test of Bayesian
approaches to information processing, is now
being made.

27
(No Transcript)
28
History of Bayesian Approaches to Human Inductive
Learning

Hunt
Suppes
Shepard
Analysis of one-shot stimulus generalization, to
explain the universal exponential law.
Anderson
Rational analysis of categorization.

29
Theory-Based Bayesian Models

Explain the success of everyday inductive leaps
based on rational statistical inference
mechanisms constrained by domain theories
well-matched to the structure of the world.
Rational statistical inference (Bayes)
Domain theories generate the necessary
ingredients hypothesis space H, priors p(h).

30
Questions about theories

What is a theory?
Working definition an ontology and a system of
abstract (causal) principles that generates a
hypothesis space of candidate world structures
(e.g., Newtons laws).
How is a theory used to learn about the structure
of the world?
How is a theory acquired?
Probabilistic generative model statistical
learning.

31
Alternative approaches to inductive generalization

Associative learning
Connectionist networks
Similarity to examples
Toolkit of simple heuristics
Constraint satisfaction

32
Marrs Three Levels of Analysis

Computation
What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?
Representation and algorithm
Cognitive psychology
Implementation
Neurobiology

33
Descriptive Goals

Principled mathematical models, with a minimum of
arbitrary assumptions.
Close quantitative fits to behavioral data.
Unified models of cognition across domains.

34
Explanatory Goals

How do we reliably acquire knowledge about the
structure of the world, from such limited
experience?
Which processing models work, and why?
New views on classic questions in cognitive
science
Symbols (rules, logic, hierarchies, relations)
versus Statistics.
Theory-based inference versus Similarity-based
inference.
Domain-specific knowledge versus Domain-general
mechanisms.
Provides a route to studying peoples hidden
(implicit or unconscious) knowledge about the
world.

35
The plan

Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
Intuitive biology Taxonomic structure
Intuitive physics Causal law

36
The plan

Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
Intuitive biology Taxonomic structure
Intuitive physics Causal law

37
Learning a single causal relation
Given a random sample of mice

To what extent does chemical X cause gene Y
to be expressed?
Or, What is the probability that X causes Y?

38
Associative models of causal strength judgment

Delta-P (or Asymptotic Rescorla-Wagner)
Power PC (Cheng, 1997)

39
Some behavioral data Buehner Cheng, 1997
People
DP
Power PC

Independent effects of both causal power and DP.
Neither theory explains the trend for DP0.

40
Bayesian causal inference

Hypotheses h1 h0

w0, w1 strength parameters for B, C
41
Bayesian causal inference

Hypotheses h1 h0
Probabilistic model noisy-OR

w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
42
Bayesian causal inference

Hypotheses h1 h0
Probabilistic model noisy-OR

B
B
Background cause B unobserved, always present
(B1)
w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
43
Inferring structure versus estimating strength

Hypotheses h1 h0
Both causal power and DP correspond to maximum
likelihood estimates of the strength parameter
w1, under different parameterizations for
p(EB,C)
linear DP, Noisy-OR causal power
Causal support model people are judging the
probability that a causal link exists, rather
than assuming it exists and estimating its
strength.

B
B
44
Role of domain theory
(c.f. PRMs, ILP, Knowledge-based model
construction)

Generates hypothesis space of causal graphical
models
Causally relevant attributes of objects
Constrains random variables (nodes).
Causally relevant relations between attributes
Constrains dependence structure of variables
(arcs).
Causal mechanisms how effects depend
functionally on their causes
Constrains local probability distribution for
each variable conditioned on its direct causes
(parents).

45
Role of domain theory

Injections may or may not cause gene expression,
but gene expression does not cause injections.
No hypotheses with E C
Other naturally occurring processes may also
cause gene expression.
All hypotheses include an always-present
background cause B C
Causes are probabilistically sufficient and
independent (Cheng) Each cause independently
produces the effect in some proportion of cases.
Noisy-OR causal mechanism

Hypotheses h1 h0
Bayesian causal inference

B
B
noisy-OR
Assume all priors uniform....
47
Bayesian Occams Razor
P( data model )
All possible data sets
48
Bayesian Occams Razor
P( data model )
low w1
high w1
All possible data sets
49
Bayesian Occams Razor
P( data model )
low w1
high w1
50
Bayesian Occams Razor
P( data model )
low w1
high w1
51
Buehner Cheng, 1997
People
DP
Power PC
Bayes
52
Sensitivity analysis

How much work does domain theory do?
Alternative model Bayes with arbitrary P(EB,C)
How much work does Bayes do?
Alternative model c2 measure of independence.

Bayes without noisy-OR theory
c2
53
People
DP
Power PC (MLE w/ noisy-OR)
Bayes w/ noisy-OR theory
Bayes without noisy-OR theory
c2
54
Varying number of observations
People (n8)
Bayes (n8)
People (n60)
Bayes (n60)
55
Data for inhibitory causes
People
DP
Power PC (MLE w/ noisy-AND-NOT)
Bayes w/ noisy-AND-NOT
56
Causal inference with rates
People
D R
Power PC (N150)
Bayes w/ Poisson parameterization
57
Causal induction summary

Peoples judgments closely reflect optimal
Bayesian model selection, constrained by a
minimal domain theory.
Beyond elemental causal induction
More complex inferences, with causal networks,
hidden variables, active learning.
Stronger inferences, with richer prior knowledge.
Discovery of causal domain theories.

58
Scope of Bayesian causal inference

Causal strength judgments
One-shot causal inferences in children and adults
(the blicket detector)
Inferring causal networks
Inferring hidden variables
Perception of causality
Perception of hidden causes
Learning causal theories

59
The plan

Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
Intuitive biology Taxonomic structure
Intuitive physics Causal law

60
The number game

Program input number between 1 and 100
Program output yes or no

61
The number game

Learning task
Observe one or more positive (yes) examples.
Judge whether other numbers are yes or no.

62
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
63
The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
64
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
65
The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
66
The number game

Main phenomena to explain
Generalization can appear either similarity-based
(graded) or rule-based (all-or-none).
Learning from just a few positive examples.

67
Rule/similarity hybrid models

Category learning
Nosofsky, Palmeri et al. RULEX
Erickson Kruschke ATRIUM

68
Divisions into rule and similarity subsystems

Category learning
Nosofsky, Palmeri et al. RULEX
Erickson Kruschke ATRIUM
Language processing
Pinker, Marcus et al. Past tense morphology
Reasoning
Sloman
Rips
Nisbett, Smith et al.

69
Rule/similarity hybrid models

Why two modules?
Why do these modules work the way that they do,
and interact as they do?
How do people infer a rule or similarity metric
from just a few positive examples?

70
Bayesian model

H Hypothesis space of possible concepts.
h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
(even numbers)
h2 10, 20, 30, 40, , 90, 100 (multiples
of 10)
h3 2, 4, 8, 16, 32, 64 (powers of 2)
h4 50, 51, 52, , 59, 60 (numbers between
50 and 60)
. . .

Representational interpretations for H
Candidate rules
Features for similarity
Consequential subsets (Shepard, 1987)

71
Three hypothesis subspaces for number concepts

Mathematical properties (24 hypotheses)
Odd, even, square, cube, prime numbers
Multiples of small integers
Powers of small integers
Raw magnitude (5050 hypotheses)
All intervals of integers with endpoints between
1 and 100.
Approximate magnitude (10 hypotheses)
Decades (1-10, 10-20, 20-30, )

72
Hypothesis spaces and theories

Why a hypothesis space is like a domain theory
Represents one particular way of classifying
entities in a domain.
Not just an arbitrary collection of hypotheses,
but a principled system.
Whats missing?
Explicit representation of the principles.
Causality.
Hypothesis space is generated by theory.

73
Bayesian model

H Hypothesis space of possible concepts.
Mathematical properties even, odd, square,
prime, . . . .
Approximate magnitude 1-10, 10-20, 20-30,
. . . .
Raw magnitude all intervals between 1 and 100.
X x1, . . . , xn n examples of a concept C.
Evaluate hypotheses given data
p(h) prior domain knowledge, pre-existing
biases
p(Xh) likelihood statistical information in
examples.
p(hX) posterior degree of belief that h is
the true extension of C.

Likelihood p(Xh)
Size principle Smaller hypotheses receive
greater likelihood, and exponentially more so as
n increases.
Follows from assumption of randomly sampled
examples.
Captures the intuition of a representative
sample.

75
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
76
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
77
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
78
Bayesian Occams Razor
M1
p(D d M )
M2
All possible data sets d
For any model M,
79

Prior p(h)
Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses.
Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.

80
A domain-general approach to priors?

Start with a base set of regularities R and
combination operators C.
Hypothesis space closure of R under C.
C and, or H unions and intersections of
regularities in R (e.g., multiples of 10 between
30 and 70).
C and-not H regularities in R with
exceptions (e.g., multiples of 10 except 50 and
70).
Two qualitatively similar priors
Description length number of combinations in C
needed to generate hypothesis from R.
Bayesian Occams Razor, with model classes
defined by number of combinations more
combinations more hypotheses lower
prior

Prior p(h)
Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses.
Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.
p(h) encodes relative plausibility of alternative
theories
Mathematical properties p(h) 1
Approximate magnitude p(h) 1/10
Raw magnitude p(h) 1/50 (on
average)
Also degrees of plausibility within a theory,
e.g., for magnitude intervals of size s

p(s)
s
82

Posterior
X 60, 80, 10, 30
Why prefer multiples of 10 over even numbers?
p(Xh).
Why prefer multiples of 10 over multiples of
10 except 50 and 20? p(h).
Why does a good generalization need both high
prior and high likelihood? p(hX) p(Xh) p(h)

83
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
84
Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
85
Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
86
Examples 16
87
Examples 16 8 2 64
88
Examples 16 23 19 20
89
Examples
Human generalization
Bayesian Model
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
90
Summary of the Bayesian model

How do the statistics of the examples interact
with prior knowledge to guide generalization?
Why does generalization appear rule-based or
similarity-based?

91
Summary of the Bayesian model

How do the statistics of the examples interact
with prior knowledge to guide generalization?
Why does generalization appear rule-based or
similarity-based?

92
Alternative models

Neural networks
Supervised learning inapplicable.
Simple unsupervised learning not sufficient

93
Alternative models

Neural networks
Similarity to exemplars
Average similarity

60
60 80 10 30
60 52 57 55
Data
Model (r 0.80)
94
Alternative models

Neural networks
Similarity to exemplars
Average similarity
Max similarity

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
95
Alternative models

Neural networks
Similarity to exemplars
Average similarity
Max similarity
Flexible similarity?

Bayes.
96
Explaining similarity

Hypothesis A principal function of similarity is
generalization.
A theory of generalization can thus explain (some
aspects of) similarity
The similarity of X to Y is to a significant
degree determined by the probability of
generalizing from X to Y, or from Y to X, or
both.
Opposite of traditional approach similarity
explains generalization.

97
Explaining similarity

Spatial models
Why exponential decay with distance?
Common feature models
Why additive measure?
What determines feature weights, and why?
Specificity
Relational preference
Diagnosticity
Context-sensitivity
Contrast model
Why (and when) are both common distinctive
features relevant?
When is similarity asymmetric?

98
Alternative models

Neural networks
Similarity to exemplars
Average similarity
Max similarity
Flexible similarity? Bayes.
Toolbox of simple heuristics
60 general similarity
60 80 10 30 most specific rule (subset
principle).
60 52 57 55 similarity in magnitude

Why these heuristics? When to use which
heuristic? Bayes.
99
Numbers Summary

Theory-based statistical inference explains
inductive generalization from one or a few
examples.
Explains the dynamics of both rule-like and
similarity-like generalization through the
interaction of
Structure of domain-specific knowledge.
Domain-general principles of rational inference.

100
Limitations of the number game

No sense in which the theory is the right or
wrong description of world structure.
Number game is conventional, not natural.
Purely logical structure of the theory does much
of the work, with statistics just selecting among
hypotheses.
Theory itself is not probabilistic.
Theory just amounts to a systematization for a
set of hypotheses.
No causal mechanisms.

101
(No Transcript)
102
Explaining similarity

Spatial models
Why exponential decay with distance?
Common feature models
Why additive measure?
What determines feature weights, and why?
Specificity
Relational preference
Diagnosticity
Context-sensitivity
Contrast model
Why (and when) are both common distinctive
features relevant?
When is similarity asymmetric?

103
A hypothesis

A principal function of similarity is
generalization.
A theory of generalization can thus explain (some
aspects of) similarity
The similarity of X to Y is to a significant
degree determined by the probability of
generalizing from X to Y, or from Y to X, or
both.
Opposite of traditional approach similarity
explains generalization.

104
Connection to feature-based similarity

Additive clustering model of similarity
Bayesian hypothesis averaging
Equivalent if we identify features fk with
hypotheses h, and weights wk with p(hX).

105
Explaining feature-based similarity

What determines the relative weights of different
features?
p(h) encodes domain-specific factors
p(Xh) encodes a domain-general factor
Size principle
Predicts

106

Additive clustering for the integers 0-9

Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
107
(No Transcript)
108
Prediction
Data mean regression slope
(s.d.) mean VAF (r2)
(s.d.)
109
Explaining feature-based similarity

What determines the relative weights of different
features?
p(h) encodes domain-specific factors
p(Xh) encodes a domain-general factor
Size principle
Predicts
Predicts relative salience of relational features

110

Why are (some but not all) relational features
more salient than surface features?

e.g.,
Hypothesis Subset size (with vocabulary
of m shapes) all same m triangle on
top m2 all different m(m-1)(m-2)
111
Feature-based similarity as Bayesian inference

A rational account of feature weighting.
Separates domain-general factors, p(Xh), from
domain-specific factors, p(h).
Predicts a domain-general scaling law
Predicts some aspects of relational salience.

112
(No Transcript)
113
The plan

Basic causal learning
Inferring number concepts
Reasoning with biological properties
Acquisition of domain theories
Intuitive biology Taxonomic structure
Intuitive physics Causal law

114

Which argument is stronger?
Horses have biotinic acid in their blood
Cows have biotinic acid in their blood
Rhinos have biotinic acid in their blood
All mammals have biotinic acid in their blood
Squirrels have biotinic acid in their blood
Dolphins have biotinic acid in their blood
Rhinos have biotinic acid in their blood
All mammals have biotinic acid in their blood

115

Osherson, Smith, Wilkie, Lopez, Shafir (1990)
20 subjects rated the strength of 45 arguments
X1 have property P.
X2 have property P.
X3 have property P.
All mammals have property P.
40 different subjects rated the similarity of all
pairs of 10 mammals.

116
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
117
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
118
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
119
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
120
Similarity-based models(Osherson et al.)

Sum-Similarity

x
x
S
x
Mammals
Examples
x
121
Similarity-based models(Osherson et al.)

Max-Similarity

x
x
max
x
Mammals
Examples
x
122
Similarity-based models(Osherson et al.)

Max-Similarity

x
x
x
Mammals
Examples
x
123
Similarity-based models(Osherson et al.)

Max-Similarity

x
x
x
Mammals
Examples
x
124
Similarity-based models(Osherson et al.)

Max-Similarity

x
x
x
Mammals
Examples
x
125
Similarity-based models(Osherson et al.)

Max-Similarity

x
x
x
Mammals
Examples
x
126
Sum-Sim versus Max-Sim

Two models appear functionally similar
Both increase monotonically as new examples are
observed.
Reasons to prefer sum-sim
Standard form of exemplar models of
categorization, memory, and object recognition.
Analogous to kernel density estimation techniques
in statistical pattern recognition.
Reasons to prefer max-sim
Fit to generalization judgments . . . .

127
Data vs. models
Data
Model
X1 have property P. X2 have property P. X3 have
property P. All mammals have property P.
.
Each represents one argument
128
Three data sets
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
129
Explaining similarity

Why does max-sim fit so well?
Why does sum-sim fit so poorly?
Are there cases where max-sim will fail?

130
Marrs Three Levels of Analysis

Computation
What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out?
Representation and algorithm
Max-Sim, Sum-Sim
Implementation
Neurobiology

131
Scientific theory of biology

Species generated by an evolutionary branching
process.
A tree-structured taxonomy of species.

132
Scientific theory of biology

Species generated by an evolutionary branching
process.
A tree-structured taxonomy of species.
Features generated by stochastic mutation process
and passed on to descendants.
Similarity a function of distance in tree.

133
An intuitive theory of biology

Species generated by an evolutionary branching
process.
A tree-structured taxonomy of species.
Features generated by stochastic mutation process
and passed on to descendants.
Similarity a function of distance in tree.

Sources Cognitive anthropology Atran, Medin
Cognitive development Keil, Carey
134
A model of theory-based induction

1. Reconstruct intuitive taxonomy from similarity
judgments

cow
chimp
horse
rhino
seal
gorilla
dolphin
mouse
squirrel
elephant
135
A model of theory-based induction

2. Hypothesis space H each taxonomic cluster is
a possible hypothesis for the extension of a
novel feature.

. . .
136
p(h) uniform
137
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
138
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
Number of examples
3
139
Cows have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
140
Seals have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
141
Scientific theory of biology

Species generated by an evolutionary branching
process.
A tree-structured taxonomy of species.
Features generated by stochastic mutation process
and passed on to descendants.
Similarity a function of distance in tree.

142
Scientific theory of biology

Species generated by an evolutionary branching
process.
A tree-structured taxonomy of species.
Features generated by stochastic mutation process
and passed on to descendants.
Similarity a function of distance in tree.
Novel features can appear anywhere in tree, but
some distributions are more likely than others.

143
A model of theory-based induction

2. Hypothesis space H each taxonomic cluster is
a possible hypothesis for the extension of a
novel feature.

. . .
144
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

145
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
146
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
147
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
148
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
149
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
150
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

Induced prior p(h)
Every subset of objects
is a possible hypothesis
Prior p(h) depends on
the number and length
of branches needed to
span h.

151
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
152
Induced prior p(h)

Monophyletic properties more likely than
polyphyletic properties

gt
p( chimp, gorilla, elephant, rhino )
p( horse, cow, elephant, rhino )
153
Induced prior p(h)

Novel properties more likely to occur on long
branches than on short branches

gt
p( horse, cow )
p( dolphin, seal )
154
p(h) evolutionary process (mutation
inheritance)
155
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
156
Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
157
Model variants

Version 1
Simple taxonomic hypothesis space instead of full
hypothesis space with prior based on mutation
process.
Version 2
Simple taxonomic hypothesis space with Hebbian
learning instead of Bayesian inference.
Version 3
Taxonomy based on actual evolutionary tree rather
than psychological similarity.

158
r0.51
r0.41
r0.90
Bayes (taxonomic)
r -0.41
r0.88
r0.45
Hebb (taxonomic)
r0.40
r0.60
r0.61
Bayes (actual evolutionary tree)
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
159
Mutation principle versus pure Occams Razor

Mutation principle provides a version of Occams
Razor, by favoring hypotheses that span fewer
disjoint clusters.
Could we use a more generic Bayesian Occams
Razor, without the biological motivation of
mutation?

160
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

Induced prior p(h)
Every subset of objects
is a possible hypothesis
Prior p(h) depends on
the number and length
of branches needed to
span h.

161
A model of theory-based induction

2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b

Induced prior p(h)
Every subset of objects
is a possible hypothesis
Prior p(h) depends on
the number and length
of branches needed to
span h.

162
Bayes (taxonomy Occam)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
163
Bayes (taxonomy mutation)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
164
Typicality meets hierarchies

Collins and Quillian semantic memory structured
hierarchically
Traditional story hierarchical structure
incompatible with typicality effects on RT.
New story typicality effects a consequence of
inference machinery, not knowledge representation.

165
Intuitive versus scientific theories of biology

Same structure for how species are related.
Tree-structured taxonomy.
Same probabilistic model for traits
Small probability of occurring along any branch
at any time, plus inheritance.
Different features
Scientist genes
People coarse anatomy and behavior

166
Markov Random Field

Define neighborhood graph by threshold on
similarity.
Nodes represent binary labeling variables m(i)
m(i) 1 if object i in concept
m(i) 0 else
Potential function on edge ij
sim(i, j) if m(i) m(j)
1- sim(i, j) else

167
r0.93
r0.80
r0.65
MRF (pairwise potentials based on similarity)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
168
Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
169
Explaining similarity

Why does max-sim fit so well?
Why does sum-sim fit so poorly?
Are there cases where max-sim will fail?

170
Explaining similarity

Why does max-sim fit so well?
An efficient and accurate approximation to
Bayesian (evolution) model.

Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Mean r 0.94
Correlation (r)
171
Explaining similarity

Why does max-sim fit so well?
Approximation is domain specific. c.f., number
game

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
172
Explaining similarity

Why does sum-sim fit so poorly?
Prefers sets of the most typical examples, which
are not representative of category as a whole.

Mean r 0.26
Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Correlation (r)
173
Explaining similarity

Are there cases where max-sim will fail?
An example from Medin et al. (in press)

Brown bears have property P Polar bears have
property P Grizzly bears have property P
Horses have property P.
Brown bears have property P Horses have property
P.
Bayesian model makes the correct prediction, due
to the size principle (assumption of examples
sampled randomly from concept).
174
A more systematic test of the Size Principle
175
Biology Summary

Theory-based statistical inference explains
taxonomic inductive reasoning in folk biology.
Reveals essential principles of domain theory.
Category structure taxonomic tree.
Feature distribution stochastic mutation process
inheritance.
Clarifies processing-level models.
Why max-sim over sum-sim?
When is max-sim a good heuristic approximation to
full Bayesian inference?