Title: Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences
1Bayesian Models of Human Learning and Inference
Josh TenenbaumMITDepartment of Brain and
Cognitive Sciences
2Shiffrin Says
- Progress in science is driven by new tools, not
great insights.
3Outline
- Part I. Brief survey of Bayesian modeling in
cognitive science. - Part II. Bayesian models of everyday inductive
leaps.
4Collaborators
- Tom Griffiths Neville Sanjana
- Charles Kemp Mark Steyvers
- Tevye Krynski Sean Stromsten
- Sourabh Niyogi
- Fei Xu Dave Sobel
- Wheeler Ruml Alison Gopnik
5Collaborators
- Tom Griffiths Neville Sanjana
- Charles Kemp Mark Steyvers
- Tevye Krynski Sean Stromsten
- Sourabh Niyogi
- Fei Xu Dave Sobel
- Wheeler Ruml Alison Gopnik
6Outline
- Part I. Brief survey of Bayesian modeling in
cognitive science. - Rational benchmark for descriptive models of
probability judgment. - Rational analysis of cognition
- Rational tools for fitting cognitive models
7Normative benchmark for descriptive models
- How does human probability judgment compare to
the Bayesian ideal? - Peterson Beach, Edwards, Tversky Kahneman, .
. . . - Explicit probability judgment tasks
- Drawing balls from an urn, rolling dice, medical
diagnosis, . . . . - Alternative descriptive models
- Heuristics and Biases, Support Theory, . . . .
8Rational analysis of cognition
- Develop Bayesian models for core aspects of
cognition not traditionally thought of in terms
of statistical inference. - Examples
- Memory retrieval Anderson Shiffrin et al, . . .
. - Reasoning with rules Oaksford Chater, . . . .
9Rational analysis of cognition
- Often can explain a wider range of phenomena than
previous models, with fewer free parameters.
Spacing effects on retention
Power laws of practice and retention
10Rational analysis of cognition
- Often can explain a wider range of phenomena than
previous models, with fewer free parameters. - Andersons rational analysis of memory
- For each item in memory, estimate the probability
that it will be useful in the present context. - Model of need probability inspired by library
book access. Corresponds to statistics of
natural information sources
11Rational analysis of cognition
- Often can explain a wider range of phenomena than
previous models, with fewer free parameters. - Andersons rational analysis of memory
- For each item in memory, estimate the probability
that it will be useful in the present context. - Model of need probability inspired by library
book access. Corresponds to statistics of
natural information sources
Short lag Long lag
Log need odds
Log days since last occurrence
12Rational analysis of cognition
- Often can show that apparently irrational
behavior is actually rational.
Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
13Rational analysis of cognition
- Often can show that apparently irrational
behavior is actually rational. - Oaksford Chaters rational analysis
- Optimal data selection based on maximizing
expected information gain. - Test the rule If p, then q against the null
hypothesis that p and q are independent. - Assuming p and q are rare predicts peoples
choices
14Rational tools for fitting cognitive models
- Use Bayesian Occams Razor to solve the problem
of model selection trade off fit to the data
with model complexity. - Examples
- Comparing alternative cognitive models Myung,
Pitt, . . . . - Fitting nested families of models of mental
representation Lee, Navarro, . . . .
15Rational tools for fitting cognitive models
- Comparing alternative cognitive models via an MDL
approximation to the Bayesian Occams Razor takes
into account the functional form of a model as
well as the number of free parameters.
16Rational tools for fitting cognitive models
- Fit models of mental representation to similarity
data, e.g. additive clustering, additive trees,
common and distinctive feature models.
- Want to choose the complexity of the model
(number of features, depth of tree) in a
principled way, and search efficiently through
the space of nested models. Using Bayesian
Occams Razor
17Outline
- Part I. Brief survey of Bayesian modeling in
cognitive science. - Part II. Bayesian models of everyday inductive
leaps.
Rational models of cognition where Bayesian model
selection, Bayesian Occams Razor play central
explanatory role.
18Everyday inductive leaps
- How can we learn so much about . . .
- Properties of natural kinds
- Meanings of words
- Future outcomes of a dynamic process
- Hidden causal properties of an object
- Causes of a persons action (beliefs, goals)
- Causal laws governing a domain
- . . . from such limited data?
19Learning concepts and words
20Learning concepts and words
- Can you pick out the tufas?
21Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
22Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
23The Challenge
- How do we generalize successfully from very
limited data? - Just one or a few examples
- Often only positive examples
- Philosophy
- Induction is a problem, a riddle, a
paradox, a scandal, or a myth. - Machine learning and statistics
- Focus on generalization from many examples, both
positive and negative.
24Rational statistical inference(Bayes, Laplace)
25History of Bayesian Approaches to Human Inductive
Learning
26History of Bayesian Approaches to Human Inductive
Learning
- Hunt
- Suppes
- Observable changes of hypotheses under positive
reinforcement, Science (1965), w/ M. Schlag-Rey.
- A tentative interpretation is that, when the set
of hypotheses is large, the subject samples or
attends to several hypotheses simultaneously. . .
. It is also conceivable that a subject might
sample spontaneously, at any time, or under
stimulations other than those planned by the
experimenter. A more detailed exploration of
these ideas, including a test of Bayesian
approaches to information processing, is now
being made.
27(No Transcript)
28History of Bayesian Approaches to Human Inductive
Learning
- Hunt
- Suppes
- Shepard
- Analysis of one-shot stimulus generalization, to
explain the universal exponential law. - Anderson
- Rational analysis of categorization.
29Theory-Based Bayesian Models
- Explain the success of everyday inductive leaps
based on rational statistical inference
mechanisms constrained by domain theories
well-matched to the structure of the world. - Rational statistical inference (Bayes)
- Domain theories generate the necessary
ingredients hypothesis space H, priors p(h).
30Questions about theories
- What is a theory?
- Working definition an ontology and a system of
abstract (causal) principles that generates a
hypothesis space of candidate world structures
(e.g., Newtons laws). - How is a theory used to learn about the structure
of the world? - How is a theory acquired?
- Probabilistic generative model statistical
learning.
31Alternative approaches to inductive generalization
- Associative learning
- Connectionist networks
- Similarity to examples
- Toolkit of simple heuristics
- Constraint satisfaction
32Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Representation and algorithm
- Cognitive psychology
- Implementation
- Neurobiology
33Descriptive Goals
- Principled mathematical models, with a minimum of
arbitrary assumptions. - Close quantitative fits to behavioral data.
- Unified models of cognition across domains.
34Explanatory Goals
- How do we reliably acquire knowledge about the
structure of the world, from such limited
experience? - Which processing models work, and why?
- New views on classic questions in cognitive
science - Symbols (rules, logic, hierarchies, relations)
versus Statistics. - Theory-based inference versus Similarity-based
inference. - Domain-specific knowledge versus Domain-general
mechanisms. - Provides a route to studying peoples hidden
(implicit or unconscious) knowledge about the
world.
35The plan
- Basic causal learning
- Inferring number concepts
- Reasoning with biological properties
- Acquisition of domain theories
- Intuitive biology Taxonomic structure
- Intuitive physics Causal law
36The plan
- Basic causal learning
- Inferring number concepts
- Reasoning with biological properties
- Acquisition of domain theories
- Intuitive biology Taxonomic structure
- Intuitive physics Causal law
37Learning a single causal relation
Given a random sample of mice
- To what extent does chemical X cause gene Y
- to be expressed?
- Or, What is the probability that X causes Y?
38Associative models of causal strength judgment
- Delta-P (or Asymptotic Rescorla-Wagner)
- Power PC (Cheng, 1997)
39Some behavioral data Buehner Cheng, 1997
People
DP
Power PC
- Independent effects of both causal power and DP.
- Neither theory explains the trend for DP0.
40Bayesian causal inference
w0, w1 strength parameters for B, C
41Bayesian causal inference
- Hypotheses h1 h0
-
- Probabilistic model noisy-OR
w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
42Bayesian causal inference
- Hypotheses h1 h0
-
- Probabilistic model noisy-OR
B
B
Background cause B unobserved, always present
(B1)
w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
43Inferring structure versus estimating strength
- Hypotheses h1 h0
- Both causal power and DP correspond to maximum
likelihood estimates of the strength parameter
w1, under different parameterizations for
p(EB,C) - linear DP, Noisy-OR causal power
- Causal support model people are judging the
probability that a causal link exists, rather
than assuming it exists and estimating its
strength.
B
B
44Role of domain theory
(c.f. PRMs, ILP, Knowledge-based model
construction)
- Generates hypothesis space of causal graphical
models - Causally relevant attributes of objects
- Constrains random variables (nodes).
- Causally relevant relations between attributes
- Constrains dependence structure of variables
(arcs). - Causal mechanisms how effects depend
functionally on their causes - Constrains local probability distribution for
each variable conditioned on its direct causes
(parents).
45Role of domain theory
- Injections may or may not cause gene expression,
but gene expression does not cause injections. - No hypotheses with E C
- Other naturally occurring processes may also
cause gene expression. - All hypotheses include an always-present
background cause B C - Causes are probabilistically sufficient and
independent (Cheng) Each cause independently
produces the effect in some proportion of cases. - Noisy-OR causal mechanism
46- Hypotheses h1 h0
-
- Bayesian causal inference
B
B
noisy-OR
Assume all priors uniform....
47Bayesian Occams Razor
P( data model )
All possible data sets
48Bayesian Occams Razor
P( data model )
low w1
high w1
All possible data sets
49Bayesian Occams Razor
P( data model )
low w1
high w1
50Bayesian Occams Razor
P( data model )
low w1
high w1
51Buehner Cheng, 1997
People
DP
Power PC
Bayes
52Sensitivity analysis
- How much work does domain theory do?
- Alternative model Bayes with arbitrary P(EB,C)
- How much work does Bayes do?
- Alternative model c2 measure of independence.
Bayes without noisy-OR theory
c2
53People
DP
Power PC (MLE w/ noisy-OR)
Bayes w/ noisy-OR theory
Bayes without noisy-OR theory
c2
54Varying number of observations
People (n8)
Bayes (n8)
People (n60)
Bayes (n60)
55Data for inhibitory causes
People
DP
Power PC (MLE w/ noisy-AND-NOT)
Bayes w/ noisy-AND-NOT
56Causal inference with rates
People
D R
Power PC (N150)
Bayes w/ Poisson parameterization
57Causal induction summary
- Peoples judgments closely reflect optimal
Bayesian model selection, constrained by a
minimal domain theory. - Beyond elemental causal induction
- More complex inferences, with causal networks,
hidden variables, active learning. - Stronger inferences, with richer prior knowledge.
- Discovery of causal domain theories.
58Scope of Bayesian causal inference
- Causal strength judgments
- One-shot causal inferences in children and adults
(the blicket detector) - Inferring causal networks
- Inferring hidden variables
- Perception of causality
- Perception of hidden causes
- Learning causal theories
59The plan
- Basic causal learning
- Inferring number concepts
- Reasoning with biological properties
- Acquisition of domain theories
- Intuitive biology Taxonomic structure
- Intuitive physics Causal law
60The number game
- Program input number between 1 and 100
- Program output yes or no
61The number game
- Learning task
- Observe one or more positive (yes) examples.
- Judge whether other numbers are yes or no.
62The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
63The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
64The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
65The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
66The number game
- Main phenomena to explain
- Generalization can appear either similarity-based
(graded) or rule-based (all-or-none). - Learning from just a few positive examples.
67Rule/similarity hybrid models
- Category learning
- Nosofsky, Palmeri et al. RULEX
- Erickson Kruschke ATRIUM
68Divisions into rule and similarity subsystems
- Category learning
- Nosofsky, Palmeri et al. RULEX
- Erickson Kruschke ATRIUM
- Language processing
- Pinker, Marcus et al. Past tense morphology
- Reasoning
- Sloman
- Rips
- Nisbett, Smith et al.
69Rule/similarity hybrid models
- Why two modules?
- Why do these modules work the way that they do,
and interact as they do? - How do people infer a rule or similarity metric
from just a few positive examples?
70Bayesian model
- H Hypothesis space of possible concepts.
- h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
(even numbers) - h2 10, 20, 30, 40, , 90, 100 (multiples
of 10) - h3 2, 4, 8, 16, 32, 64 (powers of 2)
- h4 50, 51, 52, , 59, 60 (numbers between
50 and 60) - . . .
- Representational interpretations for H
- Candidate rules
- Features for similarity
- Consequential subsets (Shepard, 1987)
71Three hypothesis subspaces for number concepts
- Mathematical properties (24 hypotheses)
- Odd, even, square, cube, prime numbers
- Multiples of small integers
- Powers of small integers
- Raw magnitude (5050 hypotheses)
- All intervals of integers with endpoints between
1 and 100. - Approximate magnitude (10 hypotheses)
- Decades (1-10, 10-20, 20-30, )
72 Hypothesis spaces and theories
- Why a hypothesis space is like a domain theory
- Represents one particular way of classifying
entities in a domain. - Not just an arbitrary collection of hypotheses,
but a principled system. - Whats missing?
- Explicit representation of the principles.
- Causality.
- Hypothesis space is generated by theory.
73Bayesian model
- H Hypothesis space of possible concepts.
- Mathematical properties even, odd, square,
prime, . . . . - Approximate magnitude 1-10, 10-20, 20-30,
. . . . - Raw magnitude all intervals between 1 and 100.
- X x1, . . . , xn n examples of a concept C.
- Evaluate hypotheses given data
- p(h) prior domain knowledge, pre-existing
biases - p(Xh) likelihood statistical information in
examples. - p(hX) posterior degree of belief that h is
the true extension of C.
74- Likelihood p(Xh)
- Size principle Smaller hypotheses receive
greater likelihood, and exponentially more so as
n increases. - Follows from assumption of randomly sampled
examples. - Captures the intuition of a representative
sample.
75Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
76Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
77Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
78Bayesian Occams Razor
M1
p(D d M )
M2
All possible data sets d
For any model M,
79- Prior p(h)
- Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses. - Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70.
80A domain-general approach to priors?
- Start with a base set of regularities R and
combination operators C. - Hypothesis space closure of R under C.
- C and, or H unions and intersections of
regularities in R (e.g., multiples of 10 between
30 and 70). - C and-not H regularities in R with
exceptions (e.g., multiples of 10 except 50 and
70). - Two qualitatively similar priors
- Description length number of combinations in C
needed to generate hypothesis from R. - Bayesian Occams Razor, with model classes
defined by number of combinations more
combinations more hypotheses lower
prior
81- Prior p(h)
- Choice of hypothesis space embodies a strong
prior effectively, p(h) 0 for many logically
possible but conceptually unnatural hypotheses. - Prevents overfitting by highly specific but
unnatural hypotheses, e.g. multiples of 10
except 50 and 70. - p(h) encodes relative plausibility of alternative
theories - Mathematical properties p(h) 1
- Approximate magnitude p(h) 1/10
- Raw magnitude p(h) 1/50 (on
average) - Also degrees of plausibility within a theory,
- e.g., for magnitude intervals of size s
p(s)
s
82- Posterior
- X 60, 80, 10, 30
- Why prefer multiples of 10 over even numbers?
p(Xh). - Why prefer multiples of 10 over multiples of
10 except 50 and 20? p(h). - Why does a good generalization need both high
prior and high likelihood? p(hX) p(Xh) p(h)
83Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
84Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
85Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
86Examples 16
87Examples 16 8 2 64
88Examples 16 23 19 20
89 Examples
Human generalization
Bayesian Model
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
90Summary of the Bayesian model
- How do the statistics of the examples interact
with prior knowledge to guide generalization? - Why does generalization appear rule-based or
similarity-based?
91Summary of the Bayesian model
- How do the statistics of the examples interact
with prior knowledge to guide generalization? - Why does generalization appear rule-based or
similarity-based?
92Alternative models
- Neural networks
- Supervised learning inapplicable.
- Simple unsupervised learning not sufficient
93Alternative models
- Neural networks
- Similarity to exemplars
- Average similarity
60
60 80 10 30
60 52 57 55
Data
Model (r 0.80)
94Alternative models
- Neural networks
- Similarity to exemplars
- Average similarity
- Max similarity
60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
95Alternative models
- Neural networks
- Similarity to exemplars
- Average similarity
- Max similarity
- Flexible similarity?
Bayes.
96Explaining similarity
- Hypothesis A principal function of similarity is
generalization. - A theory of generalization can thus explain (some
aspects of) similarity - The similarity of X to Y is to a significant
degree determined by the probability of
generalizing from X to Y, or from Y to X, or
both. - Opposite of traditional approach similarity
explains generalization.
97Explaining similarity
- Spatial models
- Why exponential decay with distance?
- Common feature models
- Why additive measure?
- What determines feature weights, and why?
- Specificity
- Relational preference
- Diagnosticity
- Context-sensitivity
- Contrast model
- Why (and when) are both common distinctive
features relevant? - When is similarity asymmetric?
98Alternative models
- Neural networks
- Similarity to exemplars
- Average similarity
- Max similarity
- Flexible similarity? Bayes.
- Toolbox of simple heuristics
- 60 general similarity
- 60 80 10 30 most specific rule (subset
principle). - 60 52 57 55 similarity in magnitude
Why these heuristics? When to use which
heuristic? Bayes.
99Numbers Summary
- Theory-based statistical inference explains
inductive generalization from one or a few
examples. - Explains the dynamics of both rule-like and
similarity-like generalization through the
interaction of - Structure of domain-specific knowledge.
- Domain-general principles of rational inference.
100Limitations of the number game
- No sense in which the theory is the right or
wrong description of world structure. - Number game is conventional, not natural.
- Purely logical structure of the theory does much
of the work, with statistics just selecting among
hypotheses. - Theory itself is not probabilistic.
- Theory just amounts to a systematization for a
set of hypotheses. - No causal mechanisms.
101(No Transcript)
102Explaining similarity
- Spatial models
- Why exponential decay with distance?
- Common feature models
- Why additive measure?
- What determines feature weights, and why?
- Specificity
- Relational preference
- Diagnosticity
- Context-sensitivity
- Contrast model
- Why (and when) are both common distinctive
features relevant? - When is similarity asymmetric?
103A hypothesis
- A principal function of similarity is
generalization. - A theory of generalization can thus explain (some
aspects of) similarity - The similarity of X to Y is to a significant
degree determined by the probability of
generalizing from X to Y, or from Y to X, or
both. - Opposite of traditional approach similarity
explains generalization.
104Connection to feature-based similarity
- Additive clustering model of similarity
- Bayesian hypothesis averaging
- Equivalent if we identify features fk with
hypotheses h, and weights wk with p(hX).
105Explaining feature-based similarity
- What determines the relative weights of different
features? - p(h) encodes domain-specific factors
- p(Xh) encodes a domain-general factor
Size principle - Predicts
106- Additive clustering for the integers 0-9
-
Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
107(No Transcript)
108Prediction
Data mean regression slope
(s.d.) mean VAF (r2)
(s.d.)
109Explaining feature-based similarity
- What determines the relative weights of different
features? - p(h) encodes domain-specific factors
- p(Xh) encodes a domain-general factor
Size principle - Predicts
- Predicts relative salience of relational features
110- Why are (some but not all) relational features
more salient than surface features?
e.g.,
Hypothesis Subset size (with vocabulary
of m shapes) all same m triangle on
top m2 all different m(m-1)(m-2)
111Feature-based similarity as Bayesian inference
- A rational account of feature weighting.
- Separates domain-general factors, p(Xh), from
domain-specific factors, p(h). - Predicts a domain-general scaling law
- Predicts some aspects of relational salience.
112(No Transcript)
113The plan
- Basic causal learning
- Inferring number concepts
- Reasoning with biological properties
- Acquisition of domain theories
- Intuitive biology Taxonomic structure
- Intuitive physics Causal law
114- Which argument is stronger?
- Horses have biotinic acid in their blood
- Cows have biotinic acid in their blood
- Rhinos have biotinic acid in their blood
- All mammals have biotinic acid in their blood
- Squirrels have biotinic acid in their blood
- Dolphins have biotinic acid in their blood
- Rhinos have biotinic acid in their blood
- All mammals have biotinic acid in their blood
115- Osherson, Smith, Wilkie, Lopez, Shafir (1990)
- 20 subjects rated the strength of 45 arguments
- X1 have property P.
- X2 have property P.
- X3 have property P.
- All mammals have property P.
- 40 different subjects rated the similarity of all
pairs of 10 mammals.
116Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
117Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
118Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
119Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
120Similarity-based models(Osherson et al.)
x
x
S
x
Mammals
Examples
x
121Similarity-based models(Osherson et al.)
x
x
max
x
Mammals
Examples
x
122Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
123Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
124Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
125Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
126Sum-Sim versus Max-Sim
- Two models appear functionally similar
- Both increase monotonically as new examples are
observed. - Reasons to prefer sum-sim
- Standard form of exemplar models of
categorization, memory, and object recognition. - Analogous to kernel density estimation techniques
in statistical pattern recognition. - Reasons to prefer max-sim
- Fit to generalization judgments . . . .
127Data vs. models
Data
Model
X1 have property P. X2 have property P. X3 have
property P. All mammals have property P.
.
Each represents one argument
128Three data sets
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
129Explaining similarity
- Why does max-sim fit so well?
- Why does sum-sim fit so poorly?
- Are there cases where max-sim will fail?
130Marrs Three Levels of Analysis
- Computation
- What is the goal of the computation, why is it
appropriate, and what is the logic of the
strategy by which it can be carried out? - Representation and algorithm
- Max-Sim, Sum-Sim
- Implementation
- Neurobiology
131Scientific theory of biology
- Species generated by an evolutionary branching
process. - A tree-structured taxonomy of species.
132Scientific theory of biology
- Species generated by an evolutionary branching
process. - A tree-structured taxonomy of species.
- Features generated by stochastic mutation process
and passed on to descendants. - Similarity a function of distance in tree.
133An intuitive theory of biology
- Species generated by an evolutionary branching
process. - A tree-structured taxonomy of species.
- Features generated by stochastic mutation process
and passed on to descendants. - Similarity a function of distance in tree.
Sources Cognitive anthropology Atran, Medin
Cognitive development Keil, Carey
134A model of theory-based induction
- 1. Reconstruct intuitive taxonomy from similarity
judgments
cow
chimp
horse
rhino
seal
gorilla
dolphin
mouse
squirrel
elephant
135A model of theory-based induction
- 2. Hypothesis space H each taxonomic cluster is
a possible hypothesis for the extension of a
novel feature.
. . .
136p(h) uniform
137Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
138Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
Number of examples
3
139Cows have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
140Seals have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
141Scientific theory of biology
- Species generated by an evolutionary branching
process. - A tree-structured taxonomy of species.
- Features generated by stochastic mutation process
and passed on to descendants. - Similarity a function of distance in tree.
142Scientific theory of biology
- Species generated by an evolutionary branching
process. - A tree-structured taxonomy of species.
- Features generated by stochastic mutation process
and passed on to descendants. - Similarity a function of distance in tree.
- Novel features can appear anywhere in tree, but
some distributions are more likely than others.
143A model of theory-based induction
- 2. Hypothesis space H each taxonomic cluster is
a possible hypothesis for the extension of a
novel feature.
. . .
144A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
145A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
146A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
147A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
148A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
149A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
150A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
- Induced prior p(h)
- Every subset of objects
- is a possible hypothesis
- Prior p(h) depends on
- the number and length
- of branches needed to
- span h.
151Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
152Induced prior p(h)
- Monophyletic properties more likely than
polyphyletic properties
gt
p( chimp, gorilla, elephant, rhino )
p( horse, cow, elephant, rhino )
153Induced prior p(h)
- Novel properties more likely to occur on long
branches than on short branches
gt
p( horse, cow )
p( dolphin, seal )
154p(h) evolutionary process (mutation
inheritance)
155Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
156Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
157Model variants
- Version 1
- Simple taxonomic hypothesis space instead of full
hypothesis space with prior based on mutation
process. - Version 2
- Simple taxonomic hypothesis space with Hebbian
learning instead of Bayesian inference. - Version 3
- Taxonomy based on actual evolutionary tree rather
than psychological similarity.
158r0.51
r0.41
r0.90
Bayes (taxonomic)
r -0.41
r0.88
r0.45
Hebb (taxonomic)
r0.40
r0.60
r0.61
Bayes (actual evolutionary tree)
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
159Mutation principle versus pure Occams Razor
- Mutation principle provides a version of Occams
Razor, by favoring hypotheses that span fewer
disjoint clusters. - Could we use a more generic Bayesian Occams
Razor, without the biological motivation of
mutation?
160A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
- Induced prior p(h)
- Every subset of objects
- is a possible hypothesis
- Prior p(h) depends on
- the number and length
- of branches needed to
- span h.
161A model of theory-based induction
- 2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
- Induced prior p(h)
- Every subset of objects
- is a possible hypothesis
- Prior p(h) depends on
- the number and length
- of branches needed to
- span h.
162Bayes (taxonomy Occam)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
163Bayes (taxonomy mutation)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
164Typicality meets hierarchies
- Collins and Quillian semantic memory structured
hierarchically - Traditional story hierarchical structure
incompatible with typicality effects on RT. - New story typicality effects a consequence of
inference machinery, not knowledge representation.
165Intuitive versus scientific theories of biology
- Same structure for how species are related.
- Tree-structured taxonomy.
- Same probabilistic model for traits
- Small probability of occurring along any branch
at any time, plus inheritance. - Different features
- Scientist genes
- People coarse anatomy and behavior
166Markov Random Field
- Define neighborhood graph by threshold on
similarity. - Nodes represent binary labeling variables m(i)
- m(i) 1 if object i in concept
- m(i) 0 else
- Potential function on edge ij
- sim(i, j) if m(i) m(j)
- 1- sim(i, j) else
167r0.93
r0.80
r0.65
MRF (pairwise potentials based on similarity)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
168Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
169Explaining similarity
- Why does max-sim fit so well?
- Why does sum-sim fit so poorly?
- Are there cases where max-sim will fail?
170Explaining similarity
- Why does max-sim fit so well?
- An efficient and accurate approximation to
Bayesian (evolution) model.
Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Mean r 0.94
Correlation (r)
171Explaining similarity
- Why does max-sim fit so well?
- Approximation is domain specific. c.f., number
game
60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
172Explaining similarity
- Why does sum-sim fit so poorly?
- Prefers sets of the most typical examples, which
are not representative of category as a whole.
Mean r 0.26
Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Correlation (r)
173Explaining similarity
- Are there cases where max-sim will fail?
- An example from Medin et al. (in press)
Brown bears have property P Polar bears have
property P Grizzly bears have property P
Horses have property P.
Brown bears have property P Horses have property
P.
Bayesian model makes the correct prediction, due
to the size principle (assumption of examples
sampled randomly from concept).
174A more systematic test of the Size Principle
175Biology Summary
- Theory-based statistical inference explains
taxonomic inductive reasoning in folk biology. - Reveals essential principles of domain theory.
- Category structure taxonomic tree.
- Feature distribution stochastic mutation process
inheritance. - Clarifies processing-level models.
- Why max-sim over sum-sim?
- When is max-sim a good heuristic approximation to
full Bayesian inference?