Monte Carlo methods Tom Griffiths UC Berkeley - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Monte Carlo methods Tom Griffiths UC Berkeley

Description:

Monte Carlo methods. Tom Griffiths. UC Berkeley. Two uses of Monte Carlo methods ... can be shown to be equivalent to Gibbs sampling (Griffiths & Kalish, 2005) ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 60
Provided by: josht83
Category:

less

Transcript and Presenter's Notes

Title: Monte Carlo methods Tom Griffiths UC Berkeley


1
Monte Carlo methodsTom GriffithsUC Berkeley
2
(No Transcript)
3
Two uses of Monte Carlo methods
  • For solving problems of probabilistic inference
    involved in developing computational models
  • As a source of hypotheses about how the mind
    might solve problems of probabilistic inference

4
Answers and expectations
  • For a function f(x) and distribution P(x), the
    expectation of f with respect to P is
  • The expectation is the average of f, when x is
    drawn from the probability distribution P

5
Answers and expectations
  • Example 1 The average of spots on a die roll
  • x 1, , 6, f(x) x, P(x) is uniform
  • Example 2 The probability two observations
    belong to the same mixture component
  • x is an assignment of observations to components,
    f(x) 1 if observations belong to same
    component and 0 otherwise, P(x) is posterior
    over assignments

6
The Monte Carlo principle
  • The expectation of f with respect to P can be
    approximated by
  • where the xi are sampled from P(x)
  • Example 1 the average of spots on a die roll

7
The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
8
More formally
  • ?MC is consistent, (?MC - ?) ? 0 a.s. as n ? ?
  • ?MC is unbiased, with E?MC ?
  • ?MC is asymptotically normal, with

9
When simple Monte Carlo fails
  • Efficient algorithms for sampling only exist for
    a relatively small number of distributions

10
Inverse cumulative distribution
1
0
  • (requires CDF be invertible)

11
Rejection sampling
p(x)
12
Rejection sampling from the posterior
  • Generate samples of all variables following the
    generative process in the model
  • Reject all samples that do not match the observed
    data

X3
X4
X1
X2
13
When simple Monte Carlo fails
  • Efficient algorithms for sampling only exist for
    a relatively small number of distributions
  • Sampling from distributions over large discrete
    state spaces is computationally expensive
  • mixture model with n observations and k
    components, HMM with n observations and k states,
    kn possibilities

14
When simple Monte Carlo fails
  • Efficient algorithms for sampling only exist for
    a relatively small number of distributions
  • Sampling from distributions over large discrete
    state spaces is computationally expensive
  • mixture model with n observations and k
    components, HMM with n observations and k states,
    kn possibilities
  • Sometimes we want to sample from distributions
    for which we only know the probability of each
    state up to a multiplicative constant

15
Why Bayesian inference is hard
Evaluating the posterior probability of a
hypothesis requires summing over all hypotheses
(statistical physics computing partition
function)
16
Modern Monte Carlo methods
  • Sampling schemes for distributions with large
    state spaces known up to a multiplicative
    constant
  • Two approaches
  • importance sampling
  • Markov chain Monte Carlo
  • (Major competitors variational inference,
    sophisticated numerical quadrature methods)

17
Importance sampling
  • Basic idea generate from the wrong
    distribution, assign weights to samples to
    correct for this

18
Importance sampling
works when sampling from proposal is easy, target
is hard
19
An alternative scheme
works when p(x) is known up to a multiplicative
constant
20
More formally
  • ?IS is consistent, (?IS - ?) ? 0 a.s. as n ? ?
  • ?IS is asymptotically normal, with
  • ?IS is biased, with

21
Optimal importance sampling
  • Asymptotic variance is
  • This is minimized by

22
Optimal importance sampling
23
(No Transcript)
24
Likelihood weighting
  • A particularly simple form of importance sampling
    for posterior distributions
  • Use the prior as the proposal distribution
  • Weights

25
Likelihood weighting
  • Generate samples of all variables except observed
    variables
  • Assign weights proportional to probability of
    observed data given values in sample

X3
X4
X1
X2
(contrast to rejection sampling from the
posterior)
26
Importance sampling
  • A general scheme for sampling from complex
    distributions that have simpler relatives
  • Simple methods for sampling from posterior
    distributions in some cases (easy to sample from
    prior, prior and posterior are close)
  • Can be more efficient than simple Monte Carlo
  • particularly for, e.g., tail probabilities
  • Also provides a solution to the question of how
    people can update beliefs as data come in

27
Particle filtering
d1
d2
d3
d4
We want to generate samples from P(s4d1, , d4)
We can use likelihood weighting if we can sample
from P(s4s3) and P(s3d1, , d3)
28
Particle filtering
samples from P(s3d1,,d3)
29
Tweaks and variations
  • If we can enumerate values of s4, can sample from
  • No need to resample at every step, since we can
    accumulate weights over multiple observations
  • resampling reduces diversity in samples
  • only necessary when variance of weights is large
  • Stratification and clever resampling schemes
    reduce variance (Fearnhead, 2001)

30
The promise of particle filters
  • People need to be able to update probability
    distributions over large hypothesis spaces as
    more data become available
  • Particle filters provide a way to do this with
    limited computing resources
  • maintain a fixed finite number of samples
  • Not just for dynamic models
  • can work with a fixed set of hypotheses, although
    this requires some further tricks for maintaining
    diversity

31
Markov chain Monte Carlo
  • Basic idea construct a Markov chain that will
    converge to the target distribution, and draw
    samples from that chain
  • Just uses something proportional to the target
    distribution (good for Bayesian inference!)
  • Can work in state spaces of arbitrary (including
    unbounded) size (good for nonparametric Bayes)

32
Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
  • Variables x(t1) independent of all previous
    variables given immediate predecessor x(t)

33
An example card shuffling
  • Each state x(t) is a permutation of a deck of
    cards (there are 52! permutations)
  • Transition matrix T indicates how likely one
    permutation will become another
  • The transition probabilities are determined by
    the shuffling procedure
  • riffle shuffle
  • overhand
  • one card

34
Convergence of Markov chains
  • Why do we shuffle cards?
  • Convergence to a uniform distribution takes only
    7 riffle shuffles
  • Other Markov chains will also converge to a
    stationary distribution, if certain simple
    conditions are satisfied (called ergodicity)
  • e.g. every state can be reached in some number of
    steps from every other state

35
Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
  • States of chain are variables of interest
  • Transition matrix chosen to give target
    distribution as stationary distribution

36
Metropolis-Hastings algorithm
  • Transitions have two parts
  • proposal distribution Q(x(t1)x(t))
  • acceptance take proposals with probability
  • A(x(t),x(t1)) min( 1,
    )

P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
37
Metropolis-Hastings algorithm
p(x)
38
Metropolis-Hastings algorithm
p(x)
39
Metropolis-Hastings algorithm
p(x)
40
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
41
Metropolis-Hastings algorithm
p(x)
42
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
43
Metropolis-Hastings in a slide
44
Metropolis-Hastings algorithm
  • For right stationary distribution, we want
  • Sufficient condition is detailed balance

45
Metropolis-Hastings algorithm
This is symmetric in (x,y) and thus satisfies
detailed balance
46
Gibbs sampling
  • Particular choice of proposal distribution
  • For variables x x1, x2, , xn
  • Draw xi(t1) from P(xix-i)
  • x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
    xn(t)
  • (this is called the full conditional distribution)

47
In a graphical model
X3
X4
X3
X4
X1
X2
X1
X2
X3
X4
X3
X4
X1
X2
X1
X2
Sample each variable conditioned on its Markov
blanket
48
Gibbs sampling
X1
X2
X1
X2
(MacKay, 2002)
49
Gibbs sampling in mixture models
?
sample assignments to components given data and
parameters
z
z
z
?
x
x
x
?
sample parameters given data and assignments to
components
z
z
z
?
x
x
x
50
MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
51
Evaluating convergence
  • Basic formal result justifying MCMC
  • expectations over sequences of variables converge
    to expectations over the stationary distribution
  • Under this result, just run MCMC as long as
    possible to get as close as possible to the truth
  • In practice, a variety of heuristics are used to
    assess convergence
  • e.g., start overdispersed chains, check ratio
    of variance between and within chains (Gelman)

52
Evaluating convergence
53
Collapsed Gibbs sampler
?
z
z
z
?
x
x
x
sum out ?, ?
z
z
z
sample assignments given data and other
assignments
x
x
x
54
Collapsed Gibbs sampler
z
z
z
sample assignments given data and other
assignments
x
x
x
with K components and a Dirichlet(?/K) prior on ?
becomes a Dirichlet process mixture when K??
55
The magic of MCMC
  • Since we only ever need to evaluate the relative
    probabilities of two states, we can have huge
    state spaces (much of which we rarely reach)
  • In fact, our state spaces can be infinite
  • common with nonparametric Bayesian models
  • But the guarantees it provides are asymptotic
  • making algorithms that converge in practical
    amounts of time is a significant challenge

56
MCMC and cognitive science
  • The main use of MCMC is for probabilistic
    inference in complex models (for modelers and
    learners)
  • The Metropolis-Hastings algorithm seems like a
    good metaphor for aspects of development
  • A form of cultural evolution can be shown to be
    equivalent to Gibbs sampling (Griffiths Kalish,
    2005)
  • We can also use MCMC algorithms as the basis for
    experiments with people
  • (see breakout session tomorrow!)

57
Samples from Subject 3(projected onto plane from
LDA)
58
Two uses of Monte Carlo methods
Three
  1. For solving problems of probabilistic inference
    involved in developing computational models
  2. As a source of hypotheses about how the mind
    might solve problems of probabilistic inference
  3. As a way to explore peoples subjective
    probability distributions

59
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com