Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences

Description:

Bayesian Models of Human Learning and Inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Shiffrin Says Progress in science is driven by new ... – PowerPoint PPT presentation

Number of Views:506
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Bayesian%20Models%20of%20Human%20Learning%20and%20Inference%20%20Josh%20Tenenbaum%20MIT%20Department%20of%20Brain%20and%20Cognitive%20Sciences


1
Bayesian Models of Human Learning and Inference
Josh TenenbaumMITDepartment of Brain and
Cognitive Sciences
2
Shiffrin Says
  • Progress in science is driven by new tools, not
    great insights.

3
Outline
  • Part I. Brief survey of Bayesian modeling in
    cognitive science.
  • Part II. Bayesian models of everyday inductive
    leaps.

4
Collaborators
  • Tom Griffiths Neville Sanjana
  • Charles Kemp Mark Steyvers
  • Tevye Krynski Sean Stromsten
  • Sourabh Niyogi
  • Fei Xu Dave Sobel
  • Wheeler Ruml Alison Gopnik

5
Collaborators
  • Tom Griffiths Neville Sanjana
  • Charles Kemp Mark Steyvers
  • Tevye Krynski Sean Stromsten
  • Sourabh Niyogi
  • Fei Xu Dave Sobel
  • Wheeler Ruml Alison Gopnik

6
Outline
  • Part I. Brief survey of Bayesian modeling in
    cognitive science.
  • Rational benchmark for descriptive models of
    probability judgment.
  • Rational analysis of cognition
  • Rational tools for fitting cognitive models

7
Normative benchmark for descriptive models
  • How does human probability judgment compare to
    the Bayesian ideal?
  • Peterson Beach, Edwards, Tversky Kahneman, .
    . . .
  • Explicit probability judgment tasks
  • Drawing balls from an urn, rolling dice, medical
    diagnosis, . . . .
  • Alternative descriptive models
  • Heuristics and Biases, Support Theory, . . . .

8
Rational analysis of cognition
  • Develop Bayesian models for core aspects of
    cognition not traditionally thought of in terms
    of statistical inference.
  • Examples
  • Memory retrieval Anderson Shiffrin et al, . . .
    .
  • Reasoning with rules Oaksford Chater, . . . .

9
Rational analysis of cognition
  • Often can explain a wider range of phenomena than
    previous models, with fewer free parameters.

Spacing effects on retention
Power laws of practice and retention
10
Rational analysis of cognition
  • Often can explain a wider range of phenomena than
    previous models, with fewer free parameters.
  • Andersons rational analysis of memory
  • For each item in memory, estimate the probability
    that it will be useful in the present context.
  • Model of need probability inspired by library
    book access. Corresponds to statistics of
    natural information sources

11
Rational analysis of cognition
  • Often can explain a wider range of phenomena than
    previous models, with fewer free parameters.
  • Andersons rational analysis of memory
  • For each item in memory, estimate the probability
    that it will be useful in the present context.
  • Model of need probability inspired by library
    book access. Corresponds to statistics of
    natural information sources

Short lag Long lag
Log need odds
Log days since last occurrence
12
Rational analysis of cognition
  • Often can show that apparently irrational
    behavior is actually rational.

Which cards do you have to turn over to test this
rule? If there is an A on one side, then there
is a 2 on the other side
13
Rational analysis of cognition
  • Often can show that apparently irrational
    behavior is actually rational.
  • Oaksford Chaters rational analysis
  • Optimal data selection based on maximizing
    expected information gain.
  • Test the rule If p, then q against the null
    hypothesis that p and q are independent.
  • Assuming p and q are rare predicts peoples
    choices

14
Rational tools for fitting cognitive models
  • Use Bayesian Occams Razor to solve the problem
    of model selection trade off fit to the data
    with model complexity.
  • Examples
  • Comparing alternative cognitive models Myung,
    Pitt, . . . .
  • Fitting nested families of models of mental
    representation Lee, Navarro, . . . .

15
Rational tools for fitting cognitive models
  • Comparing alternative cognitive models via an MDL
    approximation to the Bayesian Occams Razor takes
    into account the functional form of a model as
    well as the number of free parameters.

16
Rational tools for fitting cognitive models
  • Fit models of mental representation to similarity
    data, e.g. additive clustering, additive trees,
    common and distinctive feature models.
  • Want to choose the complexity of the model
    (number of features, depth of tree) in a
    principled way, and search efficiently through
    the space of nested models. Using Bayesian
    Occams Razor

17
Outline
  • Part I. Brief survey of Bayesian modeling in
    cognitive science.
  • Part II. Bayesian models of everyday inductive
    leaps.

Rational models of cognition where Bayesian model
selection, Bayesian Occams Razor play central
explanatory role.
18
Everyday inductive leaps
  • How can we learn so much about . . .
  • Properties of natural kinds
  • Meanings of words
  • Future outcomes of a dynamic process
  • Hidden causal properties of an object
  • Causes of a persons action (beliefs, goals)
  • Causal laws governing a domain
  • . . . from such limited data?

19
Learning concepts and words
20
Learning concepts and words
  • Can you pick out the tufas?

21
Inductive reasoning
Input
(premises)
(conclusion)
Task Judge how likely conclusion is to be
true, given that premises are true.
22
Inferring causal relations
Input
Took vitamin B23 Headache Day
1 yes no Day 2 yes yes Day
3 no yes Day 4 yes no . . .
. . . . . . Does vitamin B23 cause
headaches?
Task Judge probability of a causal link
given several joint observations.
23
The Challenge
  • How do we generalize successfully from very
    limited data?
  • Just one or a few examples
  • Often only positive examples
  • Philosophy
  • Induction is a problem, a riddle, a
    paradox, a scandal, or a myth.
  • Machine learning and statistics
  • Focus on generalization from many examples, both
    positive and negative.

24
Rational statistical inference(Bayes, Laplace)
25
History of Bayesian Approaches to Human Inductive
Learning
  • Hunt

26
History of Bayesian Approaches to Human Inductive
Learning
  • Hunt
  • Suppes
  • Observable changes of hypotheses under positive
    reinforcement, Science (1965), w/ M. Schlag-Rey.
  • A tentative interpretation is that, when the set
    of hypotheses is large, the subject samples or
    attends to several hypotheses simultaneously. . .
    . It is also conceivable that a subject might
    sample spontaneously, at any time, or under
    stimulations other than those planned by the
    experimenter. A more detailed exploration of
    these ideas, including a test of Bayesian
    approaches to information processing, is now
    being made.

27
(No Transcript)
28
History of Bayesian Approaches to Human Inductive
Learning
  • Hunt
  • Suppes
  • Shepard
  • Analysis of one-shot stimulus generalization, to
    explain the universal exponential law.
  • Anderson
  • Rational analysis of categorization.

29
Theory-Based Bayesian Models
  • Explain the success of everyday inductive leaps
    based on rational statistical inference
    mechanisms constrained by domain theories
    well-matched to the structure of the world.
  • Rational statistical inference (Bayes)
  • Domain theories generate the necessary
    ingredients hypothesis space H, priors p(h).

30
Questions about theories
  • What is a theory?
  • Working definition an ontology and a system of
    abstract (causal) principles that generates a
    hypothesis space of candidate world structures
    (e.g., Newtons laws).
  • How is a theory used to learn about the structure
    of the world?
  • How is a theory acquired?
  • Probabilistic generative model statistical
    learning.

31
Alternative approaches to inductive generalization
  • Associative learning
  • Connectionist networks
  • Similarity to examples
  • Toolkit of simple heuristics
  • Constraint satisfaction

32
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Representation and algorithm
  • Cognitive psychology
  • Implementation
  • Neurobiology

33
Descriptive Goals
  • Principled mathematical models, with a minimum of
    arbitrary assumptions.
  • Close quantitative fits to behavioral data.
  • Unified models of cognition across domains.

34
Explanatory Goals
  • How do we reliably acquire knowledge about the
    structure of the world, from such limited
    experience?
  • Which processing models work, and why?
  • New views on classic questions in cognitive
    science
  • Symbols (rules, logic, hierarchies, relations)
    versus Statistics.
  • Theory-based inference versus Similarity-based
    inference.
  • Domain-specific knowledge versus Domain-general
    mechanisms.
  • Provides a route to studying peoples hidden
    (implicit or unconscious) knowledge about the
    world.

35
The plan
  • Basic causal learning
  • Inferring number concepts
  • Reasoning with biological properties
  • Acquisition of domain theories
  • Intuitive biology Taxonomic structure
  • Intuitive physics Causal law

36
The plan
  • Basic causal learning
  • Inferring number concepts
  • Reasoning with biological properties
  • Acquisition of domain theories
  • Intuitive biology Taxonomic structure
  • Intuitive physics Causal law

37
Learning a single causal relation
Given a random sample of mice
  • To what extent does chemical X cause gene Y
  • to be expressed?
  • Or, What is the probability that X causes Y?

38
Associative models of causal strength judgment
  • Delta-P (or Asymptotic Rescorla-Wagner)
  • Power PC (Cheng, 1997)

39
Some behavioral data Buehner Cheng, 1997
People
DP
Power PC
  • Independent effects of both causal power and DP.
  • Neither theory explains the trend for DP0.

40
Bayesian causal inference
  • Hypotheses h1 h0

w0, w1 strength parameters for B, C
41
Bayesian causal inference
  • Hypotheses h1 h0
  • Probabilistic model noisy-OR

w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
42
Bayesian causal inference
  • Hypotheses h1 h0
  • Probabilistic model noisy-OR

B
B
Background cause B unobserved, always present
(B1)
w0, w1 strength parameters for B, C
C
B
h1
h0
0 0 1 0 0 1 1 1
0 w1 w0 w1 w0 w1 w0
0 0 w0 w0
43
Inferring structure versus estimating strength
  • Hypotheses h1 h0
  • Both causal power and DP correspond to maximum
    likelihood estimates of the strength parameter
    w1, under different parameterizations for
    p(EB,C)
  • linear DP, Noisy-OR causal power
  • Causal support model people are judging the
    probability that a causal link exists, rather
    than assuming it exists and estimating its
    strength.

B
B
44
Role of domain theory
(c.f. PRMs, ILP, Knowledge-based model
construction)
  • Generates hypothesis space of causal graphical
    models
  • Causally relevant attributes of objects
  • Constrains random variables (nodes).
  • Causally relevant relations between attributes
  • Constrains dependence structure of variables
    (arcs).
  • Causal mechanisms how effects depend
    functionally on their causes
  • Constrains local probability distribution for
    each variable conditioned on its direct causes
    (parents).

45
Role of domain theory
  • Injections may or may not cause gene expression,
    but gene expression does not cause injections.
  • No hypotheses with E C
  • Other naturally occurring processes may also
    cause gene expression.
  • All hypotheses include an always-present
    background cause B C
  • Causes are probabilistically sufficient and
    independent (Cheng) Each cause independently
    produces the effect in some proportion of cases.
  • Noisy-OR causal mechanism

46
  • Hypotheses h1 h0
  • Bayesian causal inference

B
B
noisy-OR
Assume all priors uniform....
47
Bayesian Occams Razor
P( data model )
All possible data sets
48
Bayesian Occams Razor
P( data model )
low w1
high w1
All possible data sets
49
Bayesian Occams Razor
P( data model )
low w1
high w1
50
Bayesian Occams Razor
P( data model )
low w1
high w1
51
Buehner Cheng, 1997
People
DP
Power PC
Bayes
52
Sensitivity analysis
  • How much work does domain theory do?
  • Alternative model Bayes with arbitrary P(EB,C)
  • How much work does Bayes do?
  • Alternative model c2 measure of independence.

Bayes without noisy-OR theory
c2
53
People
DP
Power PC (MLE w/ noisy-OR)
Bayes w/ noisy-OR theory
Bayes without noisy-OR theory
c2
54
Varying number of observations
People (n8)
Bayes (n8)
People (n60)
Bayes (n60)
55
Data for inhibitory causes
People
DP
Power PC (MLE w/ noisy-AND-NOT)
Bayes w/ noisy-AND-NOT
56
Causal inference with rates
People
D R
Power PC (N150)
Bayes w/ Poisson parameterization
57
Causal induction summary
  • Peoples judgments closely reflect optimal
    Bayesian model selection, constrained by a
    minimal domain theory.
  • Beyond elemental causal induction
  • More complex inferences, with causal networks,
    hidden variables, active learning.
  • Stronger inferences, with richer prior knowledge.
  • Discovery of causal domain theories.

58
Scope of Bayesian causal inference
  • Causal strength judgments
  • One-shot causal inferences in children and adults
    (the blicket detector)
  • Inferring causal networks
  • Inferring hidden variables
  • Perception of causality
  • Perception of hidden causes
  • Learning causal theories

59
The plan
  • Basic causal learning
  • Inferring number concepts
  • Reasoning with biological properties
  • Acquisition of domain theories
  • Intuitive biology Taxonomic structure
  • Intuitive physics Causal law

60
The number game
  • Program input number between 1 and 100
  • Program output yes or no

61
The number game
  • Learning task
  • Observe one or more positive (yes) examples.
  • Judge whether other numbers are yes or no.

62
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
63
The number game
Examples of yes numbers
Generalization judgments (n 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
64
The number game
Examples of yes numbers
Generalization judgments (N 20)
60
Diffuse similarity
60 80 10 30
Rule multiples of 10
Focused similarity numbers near 50-60
60 52 57 55
65
The number game
Examples of yes numbers
Generalization judgments (N 20)
16
Diffuse similarity
16 8 2 64
Rule powers of 2
Focused similarity numbers near 20
16 23 19 20
66
The number game
  • Main phenomena to explain
  • Generalization can appear either similarity-based
    (graded) or rule-based (all-or-none).
  • Learning from just a few positive examples.

67
Rule/similarity hybrid models
  • Category learning
  • Nosofsky, Palmeri et al. RULEX
  • Erickson Kruschke ATRIUM

68
Divisions into rule and similarity subsystems
  • Category learning
  • Nosofsky, Palmeri et al. RULEX
  • Erickson Kruschke ATRIUM
  • Language processing
  • Pinker, Marcus et al. Past tense morphology
  • Reasoning
  • Sloman
  • Rips
  • Nisbett, Smith et al.

69
Rule/similarity hybrid models
  • Why two modules?
  • Why do these modules work the way that they do,
    and interact as they do?
  • How do people infer a rule or similarity metric
    from just a few positive examples?

70
Bayesian model
  • H Hypothesis space of possible concepts.
  • h1 2, 4, 6, 8, 10, 12, , 96, 98, 100
    (even numbers)
  • h2 10, 20, 30, 40, , 90, 100 (multiples
    of 10)
  • h3 2, 4, 8, 16, 32, 64 (powers of 2)
  • h4 50, 51, 52, , 59, 60 (numbers between
    50 and 60)
  • . . .
  • Representational interpretations for H
  • Candidate rules
  • Features for similarity
  • Consequential subsets (Shepard, 1987)

71
Three hypothesis subspaces for number concepts
  • Mathematical properties (24 hypotheses)
  • Odd, even, square, cube, prime numbers
  • Multiples of small integers
  • Powers of small integers
  • Raw magnitude (5050 hypotheses)
  • All intervals of integers with endpoints between
    1 and 100.
  • Approximate magnitude (10 hypotheses)
  • Decades (1-10, 10-20, 20-30, )

72
Hypothesis spaces and theories
  • Why a hypothesis space is like a domain theory
  • Represents one particular way of classifying
    entities in a domain.
  • Not just an arbitrary collection of hypotheses,
    but a principled system.
  • Whats missing?
  • Explicit representation of the principles.
  • Causality.
  • Hypothesis space is generated by theory.

73
Bayesian model
  • H Hypothesis space of possible concepts.
  • Mathematical properties even, odd, square,
    prime, . . . .
  • Approximate magnitude 1-10, 10-20, 20-30,
    . . . .
  • Raw magnitude all intervals between 1 and 100.
  • X x1, . . . , xn n examples of a concept C.
  • Evaluate hypotheses given data
  • p(h) prior domain knowledge, pre-existing
    biases
  • p(Xh) likelihood statistical information in
    examples.
  • p(hX) posterior degree of belief that h is
    the true extension of C.

74
  • Likelihood p(Xh)
  • Size principle Smaller hypotheses receive
    greater likelihood, and exponentially more so as
    n increases.
  • Follows from assumption of randomly sampled
    examples.
  • Captures the intuition of a representative
    sample.

75
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
76
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data slightly more of a coincidence under h1
77
Illustrating the size principle
2 4 6 8 10 12 14 16 18 20 22
24 26 28 30 32 34 36 38 40 42 44 46
48 50 52 54 56 58 60 62 64 66 68 70
72 74 76 78 80 82 84 86 88 90 92 94
96 98 100
h1
h2
Data much more of a coincidence under h1
78
Bayesian Occams Razor
M1
p(D d M )
M2
All possible data sets d
For any model M,
79
  • Prior p(h)
  • Choice of hypothesis space embodies a strong
    prior effectively, p(h) 0 for many logically
    possible but conceptually unnatural hypotheses.
  • Prevents overfitting by highly specific but
    unnatural hypotheses, e.g. multiples of 10
    except 50 and 70.

80
A domain-general approach to priors?
  • Start with a base set of regularities R and
    combination operators C.
  • Hypothesis space closure of R under C.
  • C and, or H unions and intersections of
    regularities in R (e.g., multiples of 10 between
    30 and 70).
  • C and-not H regularities in R with
    exceptions (e.g., multiples of 10 except 50 and
    70).
  • Two qualitatively similar priors
  • Description length number of combinations in C
    needed to generate hypothesis from R.
  • Bayesian Occams Razor, with model classes
    defined by number of combinations more
    combinations more hypotheses lower
    prior

81
  • Prior p(h)
  • Choice of hypothesis space embodies a strong
    prior effectively, p(h) 0 for many logically
    possible but conceptually unnatural hypotheses.
  • Prevents overfitting by highly specific but
    unnatural hypotheses, e.g. multiples of 10
    except 50 and 70.
  • p(h) encodes relative plausibility of alternative
    theories
  • Mathematical properties p(h) 1
  • Approximate magnitude p(h) 1/10
  • Raw magnitude p(h) 1/50 (on
    average)
  • Also degrees of plausibility within a theory,
  • e.g., for magnitude intervals of size s

p(s)
s
82
  • Posterior
  • X 60, 80, 10, 30
  • Why prefer multiples of 10 over even numbers?
    p(Xh).
  • Why prefer multiples of 10 over multiples of
    10 except 50 and 20? p(h).
  • Why does a good generalization need both high
    prior and high likelihood? p(hX) p(Xh) p(h)

83
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
84
Generalizing to new objects
Given p(hX), how do we compute ,
the probability that C applies to some new
stimulus y?
85
Generalizing to new objects
Hypothesis averaging Compute the probability
that C applies to some new object y by averaging
the predictions of all hypotheses h, weighted by
p(hX)
86
Examples 16
87
Examples 16 8 2 64
88
Examples 16 23 19 20
89
Examples
Human generalization
Bayesian Model
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
90
Summary of the Bayesian model
  • How do the statistics of the examples interact
    with prior knowledge to guide generalization?
  • Why does generalization appear rule-based or
    similarity-based?

91
Summary of the Bayesian model
  • How do the statistics of the examples interact
    with prior knowledge to guide generalization?
  • Why does generalization appear rule-based or
    similarity-based?

92
Alternative models
  • Neural networks
  • Supervised learning inapplicable.
  • Simple unsupervised learning not sufficient

93
Alternative models
  • Neural networks
  • Similarity to exemplars
  • Average similarity

60
60 80 10 30
60 52 57 55
Data
Model (r 0.80)
94
Alternative models
  • Neural networks
  • Similarity to exemplars
  • Average similarity
  • Max similarity

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
95
Alternative models
  • Neural networks
  • Similarity to exemplars
  • Average similarity
  • Max similarity
  • Flexible similarity?

Bayes.
96
Explaining similarity
  • Hypothesis A principal function of similarity is
    generalization.
  • A theory of generalization can thus explain (some
    aspects of) similarity
  • The similarity of X to Y is to a significant
    degree determined by the probability of
    generalizing from X to Y, or from Y to X, or
    both.
  • Opposite of traditional approach similarity
    explains generalization.

97
Explaining similarity
  • Spatial models
  • Why exponential decay with distance?
  • Common feature models
  • Why additive measure?
  • What determines feature weights, and why?
  • Specificity
  • Relational preference
  • Diagnosticity
  • Context-sensitivity
  • Contrast model
  • Why (and when) are both common distinctive
    features relevant?
  • When is similarity asymmetric?

98
Alternative models
  • Neural networks
  • Similarity to exemplars
  • Average similarity
  • Max similarity
  • Flexible similarity? Bayes.
  • Toolbox of simple heuristics
  • 60 general similarity
  • 60 80 10 30 most specific rule (subset
    principle).
  • 60 52 57 55 similarity in magnitude

Why these heuristics? When to use which
heuristic? Bayes.
99
Numbers Summary
  • Theory-based statistical inference explains
    inductive generalization from one or a few
    examples.
  • Explains the dynamics of both rule-like and
    similarity-like generalization through the
    interaction of
  • Structure of domain-specific knowledge.
  • Domain-general principles of rational inference.

100
Limitations of the number game
  • No sense in which the theory is the right or
    wrong description of world structure.
  • Number game is conventional, not natural.
  • Purely logical structure of the theory does much
    of the work, with statistics just selecting among
    hypotheses.
  • Theory itself is not probabilistic.
  • Theory just amounts to a systematization for a
    set of hypotheses.
  • No causal mechanisms.

101
(No Transcript)
102
Explaining similarity
  • Spatial models
  • Why exponential decay with distance?
  • Common feature models
  • Why additive measure?
  • What determines feature weights, and why?
  • Specificity
  • Relational preference
  • Diagnosticity
  • Context-sensitivity
  • Contrast model
  • Why (and when) are both common distinctive
    features relevant?
  • When is similarity asymmetric?

103
A hypothesis
  • A principal function of similarity is
    generalization.
  • A theory of generalization can thus explain (some
    aspects of) similarity
  • The similarity of X to Y is to a significant
    degree determined by the probability of
    generalizing from X to Y, or from Y to X, or
    both.
  • Opposite of traditional approach similarity
    explains generalization.

104
Connection to feature-based similarity
  • Additive clustering model of similarity
  • Bayesian hypothesis averaging
  • Equivalent if we identify features fk with
    hypotheses h, and weights wk with p(hX).

105
Explaining feature-based similarity
  • What determines the relative weights of different
    features?
  • p(h) encodes domain-specific factors
  • p(Xh) encodes a domain-general factor
    Size principle
  • Predicts

106
  • Additive clustering for the integers 0-9

Rank Weight Stimuli in cluster Interpretation
0 1 2 3 4 5 6 7 8 9 1 .444
powers
of two 2 .345 small numbers 3 .331
multiples of
three 4 .291
large numbers 5 .255
middle numbers 6 .216
odd numbers 7 .214 smallish
numbers 8 .172
largish numbers
107
(No Transcript)
108
Prediction
Data mean regression slope
(s.d.) mean VAF (r2)
(s.d.)
109
Explaining feature-based similarity
  • What determines the relative weights of different
    features?
  • p(h) encodes domain-specific factors
  • p(Xh) encodes a domain-general factor
    Size principle
  • Predicts
  • Predicts relative salience of relational features

110
  • Why are (some but not all) relational features
    more salient than surface features?

e.g.,
Hypothesis Subset size (with vocabulary
of m shapes) all same m triangle on
top m2 all different m(m-1)(m-2)
111
Feature-based similarity as Bayesian inference
  • A rational account of feature weighting.
  • Separates domain-general factors, p(Xh), from
    domain-specific factors, p(h).
  • Predicts a domain-general scaling law
  • Predicts some aspects of relational salience.

112
(No Transcript)
113
The plan
  • Basic causal learning
  • Inferring number concepts
  • Reasoning with biological properties
  • Acquisition of domain theories
  • Intuitive biology Taxonomic structure
  • Intuitive physics Causal law

114
  • Which argument is stronger?
  • Horses have biotinic acid in their blood
  • Cows have biotinic acid in their blood
  • Rhinos have biotinic acid in their blood
  • All mammals have biotinic acid in their blood
  • Squirrels have biotinic acid in their blood
  • Dolphins have biotinic acid in their blood
  • Rhinos have biotinic acid in their blood
  • All mammals have biotinic acid in their blood

115
  • Osherson, Smith, Wilkie, Lopez, Shafir (1990)
  • 20 subjects rated the strength of 45 arguments
  • X1 have property P.
  • X2 have property P.
  • X3 have property P.
  • All mammals have property P.
  • 40 different subjects rated the similarity of all
    pairs of 10 mammals.

116
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
117
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
118
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
119
Similarity-based models(Osherson et al.)
x
x
x
Mammals
Examples
x
120
Similarity-based models(Osherson et al.)
  • Sum-Similarity

x
x
S
x
Mammals
Examples
x
121
Similarity-based models(Osherson et al.)
  • Max-Similarity

x
x
max
x
Mammals
Examples
x
122
Similarity-based models(Osherson et al.)
  • Max-Similarity

x
x
x
Mammals
Examples
x
123
Similarity-based models(Osherson et al.)
  • Max-Similarity

x
x
x
Mammals
Examples
x
124
Similarity-based models(Osherson et al.)
  • Max-Similarity

x
x
x
Mammals
Examples
x
125
Similarity-based models(Osherson et al.)
  • Max-Similarity

x
x
x
Mammals
Examples
x
126
Sum-Sim versus Max-Sim
  • Two models appear functionally similar
  • Both increase monotonically as new examples are
    observed.
  • Reasons to prefer sum-sim
  • Standard form of exemplar models of
    categorization, memory, and object recognition.
  • Analogous to kernel density estimation techniques
    in statistical pattern recognition.
  • Reasons to prefer max-sim
  • Fit to generalization judgments . . . .

127
Data vs. models
Data
Model
X1 have property P. X2 have property P. X3 have
property P. All mammals have property P.
.
Each represents one argument
128
Three data sets
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
129
Explaining similarity
  • Why does max-sim fit so well?
  • Why does sum-sim fit so poorly?
  • Are there cases where max-sim will fail?

130
Marrs Three Levels of Analysis
  • Computation
  • What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?
  • Representation and algorithm
  • Max-Sim, Sum-Sim
  • Implementation
  • Neurobiology

131
Scientific theory of biology
  • Species generated by an evolutionary branching
    process.
  • A tree-structured taxonomy of species.

132
Scientific theory of biology
  • Species generated by an evolutionary branching
    process.
  • A tree-structured taxonomy of species.
  • Features generated by stochastic mutation process
    and passed on to descendants.
  • Similarity a function of distance in tree.

133
An intuitive theory of biology
  • Species generated by an evolutionary branching
    process.
  • A tree-structured taxonomy of species.
  • Features generated by stochastic mutation process
    and passed on to descendants.
  • Similarity a function of distance in tree.

Sources Cognitive anthropology Atran, Medin
Cognitive development Keil, Carey
134
A model of theory-based induction
  • 1. Reconstruct intuitive taxonomy from similarity
    judgments

cow
chimp
horse
rhino
seal
gorilla
dolphin
mouse
squirrel
elephant
135
A model of theory-based induction
  • 2. Hypothesis space H each taxonomic cluster is
    a possible hypothesis for the extension of a
    novel feature.

. . .
136
p(h) uniform
137
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
138
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
Number of examples
3
139
Cows have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
140
Seals have property P. Dolphins have property
P. Squirrels have property P. All mammals have
property P.
141
Scientific theory of biology
  • Species generated by an evolutionary branching
    process.
  • A tree-structured taxonomy of species.
  • Features generated by stochastic mutation process
    and passed on to descendants.
  • Similarity a function of distance in tree.

142
Scientific theory of biology
  • Species generated by an evolutionary branching
    process.
  • A tree-structured taxonomy of species.
  • Features generated by stochastic mutation process
    and passed on to descendants.
  • Similarity a function of distance in tree.
  • Novel features can appear anywhere in tree, but
    some distributions are more likely than others.

143
A model of theory-based induction
  • 2. Hypothesis space H each taxonomic cluster is
    a possible hypothesis for the extension of a
    novel feature.

. . .
144
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b

145
A model of theory-based induction
2. Generate hypotheses for novel feature F via
(Poisson arrival) mutation process over branches
b
cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
146
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
147
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
148
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
149
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b

cow
chimp
rhino
horse
seal
gorilla
mouse
dolphin
squirrel
elephant
150
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b
  • Induced prior p(h)
  • Every subset of objects
  • is a possible hypothesis
  • Prior p(h) depends on
  • the number and length
  • of branches needed to
  • span h.

151
Bayesian Occams Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
152
Induced prior p(h)
  • Monophyletic properties more likely than
    polyphyletic properties

gt
p( chimp, gorilla, elephant, rhino )
p( horse, cow, elephant, rhino )
153
Induced prior p(h)
  • Novel properties more likely to occur on long
    branches than on short branches

gt
p( horse, cow )
p( dolphin, seal )
154
p(h) evolutionary process (mutation
inheritance)
155
Bayes (taxonomic)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
156
Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
157
Model variants
  • Version 1
  • Simple taxonomic hypothesis space instead of full
    hypothesis space with prior based on mutation
    process.
  • Version 2
  • Simple taxonomic hypothesis space with Hebbian
    learning instead of Bayesian inference.
  • Version 3
  • Taxonomy based on actual evolutionary tree rather
    than psychological similarity.

158
r0.51
r0.41
r0.90
Bayes (taxonomic)
r -0.41
r0.88
r0.45
Hebb (taxonomic)
r0.40
r0.60
r0.61
Bayes (actual evolutionary tree)
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
159
Mutation principle versus pure Occams Razor
  • Mutation principle provides a version of Occams
    Razor, by favoring hypotheses that span fewer
    disjoint clusters.
  • Could we use a more generic Bayesian Occams
    Razor, without the biological motivation of
    mutation?

160
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b
  • Induced prior p(h)
  • Every subset of objects
  • is a possible hypothesis
  • Prior p(h) depends on
  • the number and length
  • of branches needed to
  • span h.

161
A model of theory-based induction
  • 2. Generate hypotheses for novel feature F via
    (Poisson arrival) mutation process over branches
    b
  • Induced prior p(h)
  • Every subset of objects
  • is a possible hypothesis
  • Prior p(h) depends on
  • the number and length
  • of branches needed to
  • span h.

162
Bayes (taxonomy Occam)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
163
Bayes (taxonomy mutation)
Premise typicality effect (Rips, 1975
Osherson et al., 1990) Strong Weak
Max-sim
Horses have property P. All mammals have
property P.
Sum-sim
Seals have property P. All mammals have
property P.
Conclusion kind
all mammals
Number of examples
1
164
Typicality meets hierarchies
  • Collins and Quillian semantic memory structured
    hierarchically
  • Traditional story hierarchical structure
    incompatible with typicality effects on RT.
  • New story typicality effects a consequence of
    inference machinery, not knowledge representation.

165
Intuitive versus scientific theories of biology
  • Same structure for how species are related.
  • Tree-structured taxonomy.
  • Same probabilistic model for traits
  • Small probability of occurring along any branch
    at any time, plus inheritance.
  • Different features
  • Scientist genes
  • People coarse anatomy and behavior

166
Markov Random Field
  • Define neighborhood graph by threshold on
    similarity.
  • Nodes represent binary labeling variables m(i)
  • m(i) 1 if object i in concept
  • m(i) 0 else
  • Potential function on edge ij
  • sim(i, j) if m(i) m(j)
  • 1- sim(i, j) else

167
r0.93
r0.80
r0.65
MRF (pairwise potentials based on similarity)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
168
Bayes (taxonomy mutation)
Max-sim
Sum-sim
Conclusion kind
all mammals
horses
horses
Number of examples
3 2
1, 2, or 3
169
Explaining similarity
  • Why does max-sim fit so well?
  • Why does sum-sim fit so poorly?
  • Are there cases where max-sim will fail?

170
Explaining similarity
  • Why does max-sim fit so well?
  • An efficient and accurate approximation to
    Bayesian (evolution) model.

Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Mean r 0.94
Correlation (r)
171
Explaining similarity
  • Why does max-sim fit so well?
  • Approximation is domain specific. c.f., number
    game

60
60 80 10 30
60 52 57 55
Model (r 0.64)
Data
172
Explaining similarity
  • Why does sum-sim fit so poorly?
  • Prefers sets of the most typical examples, which
    are not representative of category as a whole.

Mean r 0.26
Correlation with Bayes on three-premise general
arguments, over 100 simulated tree structures
Correlation (r)
173
Explaining similarity
  • Are there cases where max-sim will fail?
  • An example from Medin et al. (in press)

Brown bears have property P Polar bears have
property P Grizzly bears have property P
Horses have property P.
Brown bears have property P Horses have property
P.
Bayesian model makes the correct prediction, due
to the size principle (assumption of examples
sampled randomly from concept).
174
A more systematic test of the Size Principle
175
Biology Summary
  • Theory-based statistical inference explains
    taxonomic inductive reasoning in folk biology.
  • Reveals essential principles of domain theory.
  • Category structure taxonomic tree.
  • Feature distribution stochastic mutation process
    inheritance.
  • Clarifies processing-level models.
  • Why max-sim over sum-sim?
  • When is max-sim a good heuristic approximation to
    full Bayesian inference?
Write a Comment
User Comments (0)
About PowerShow.com