Title: Exact and approximate inference in probabilistic graphical models
1Exact and approximate inference in probabilistic
graphical models
- Kevin MurphyMIT CSAIL UBC CS/Stats
www.ai.mit.edu/murphyk/AAAI04
AAAI 2004 tutorial
2Recommended reading
- Cowell, Dawid, Lauritzen, Spiegelhalter,
Probabilistic Networks and Expert Systems 1999 - Jensen 2001, Bayesian Networks and Decision
Graphs - Jordan (due 2005) Probabilistic graphical
models - Koller Friedman (due 2005), Bayes nets and
beyond - Learning in graphical models,edited M. Jordan
3Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
42 reasons for approximate inference
Low treewidth BUTNon-linear/ Non-Gaussian
High tree width
Chains
Nnxn grid
eg non-linear dynamical system
Trees (no loops)
eg (Bayesian) parameter estimation
X3
X1
X2
Loopy graphs
Y3
Y1
Y2
?
5Complexity of approximate inference
- Approximating P(XqXe) to within a constant
factor for all discrete BNs is NP-hard. - In practice, many models exhibit weak coupling,
so we may safely ignore certain dependencies. - Computing P(XqXe) for all polytrees with
discrete and Gaussian nodes is NP-hard. - In practice, some of the modes of the posterior
will have negligible mass.
Dagum93
Lerner01
62 objective functions
- Approximate true posterior P(hv) by Q(h)
- Variational globally optimize all terms wrt
simpler Q - Expectation propagation (EP) sequentially
optimize each term
P
Q
P
Q
min D(QP)
min D(PQ)
Q0 gt P0
P0 gt Q0
7Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
8Free energy
- Variational goal minimize D(PQ) wrt Q, where Q
has a simpler form than P - P(h,v) simpler than P(hv), so use
- Free energy is upper bound on neg log-likelihood
9Point estimation
- Use
- Minimize
- Iterative Conditional Modes (ICM)
- For each iteration, for each hi
- Example K-means clustering
- Ignores uncertainty in P(hv), P(?v)
- Tends to get stuck in local minima
Factors in markov blanket of hi
10Expectation Maximization (EM)
- Point estimates for parameters (ML or MAP), full
posterior for hidden vars. - E-step minimize F(Q,P) wrt Q(h)
- M-step minimize F(Q,P) wrt Q(h?)
Exact inference
Parameter prior
Expected complete-data log-likelihood
11EM tricks of the trade
- Generalized EM
- Partial M-step reduce F(Q,P) wrt Q(h?)e.g.,
gradient method - Partial E-step reduce F(Q,P) wrt
Q(h)approximate inference - Avoiding local optima
- Deterministic Annealing
- Data resampling
- Speedup tricks
- Combine with conjugate gradient
- Online/incremental updates
Neal98
Rose98
Elidan02
Salakhutdinov03
Bauer97,Neal98
12Variational Bayes (VB)
Ghahramani00,Beal02
- Use
- For exponential family models with conjugate
priors, this results in a generalized version of
EM - E-step modified inference to take into account
uncertainty of parameters - M-step optimize Q(h?) using expected sufficient
statistics - Variational Message Passing automates this,
assuming a fully factorized (mean field) Q
Winn04
variational-Bayes.org
13Variational inference for discrete state models
with high treewidth
- We assume the parameters are fixed.
- We assume Q(h) has a simple form, so we can
easily find - Mean field
- Structured variational
Xing04
Mean field
Product of chains
Grid MRF
14Variational inference for MRFs
- Probability is exp(-energy)
- Free energy average energy - entropy
15Mean field for MRFs
- Fully factorized approximation
- Normalization constraint
- Average energy
- Entropy
- Local minima satisfy
16Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
17BP vs mean field for MRFs
- Mean field updates
- BP updates
- Every node i sends a different message to j
- Empirically, BP much better than MF (e.g., MF is
not exact even for trees) - BP is (attempting to) minimize the Bethe free
energy
Weiss01
Yedidia01
18Bethe free energy
- We assume the graph is a tree, in which case the
following is exact - Constraints
- Normalization
- Marginalization
- Average energy
- Entropy
di neighbors for node i
19BP minimizes Bethe free energy
Yedidia01
- Theorem Yedida, Freeman, Weiss fixed points of
BP are local stationary points of the Bethe free
energy - BP may not converge other algorithms can
directly minimize F?, but are slower. - If BP does not converge, it often means F? is a
poor approximation
20Kikuchi free energy
- Cluster groups of nodes together
- Energy per region
- Free energy per region
- Kikuchi free energy
Counting numbers
21Counting numbers
3
1
2
1
2
3
6
4
5
6
4
5
Bethe
Kikuchi
12 23 14 25 36 45 56
Region graphs
1245 2356
1 2 3 4 5 6
25
C -1 -2 -1 -1 -2 -1
C1-(11)-1
Fkikuchi is exact if region graph contains 2
levels (regions and intersections)and has no
cycles equivalent to junction tree!
22Generalized BP
3
1
2
- 2356 4578 5689
25 45 56 58
6
4
5
5
9
7
8
- Fkikuchu no longer exact, but more accurate than
Fbethe - Generalized BP can be used to minimize Fkikuchi
- This method of choosing regions is called the
cluster variational method - In the limit, we recover the junction tree
algorithm.
Welling04
23Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
24Expectation Propagation (EP)
Minka01
- EP iterated assumed density filtering
- ADF recursive Bayesian estimation interleaved
with projection step - Examples of ADF
- Extended Kalman filtering
- Moment-matching (weak marginalization)
- Boyen-Koller algorithm
- Some online learning algorithms
25Assumed Density Filtering (ADF)
x
Recursive Bayesian estimation(sequential
updating of posterior)
Y1
Yn
- If p(yix) not conjugate to p(x), then p(xy1i)
may not be tractably representable - So project posterior back to representable family
- And repeat
update
project
Projection becomes moment matching
26Expectation Propagation
- ADF is sensitive to the order of updates.
- ADF approximates each posterior myopically.
- EP iteratively approximate each term.
intractable
Simple, non-iterative, inaccurate
EP
Simple, iterative, accurate
After Ghahramani
27Expectation Propagation
- Input
- Initialize
- Repeat
- For i0..N
- Deletion
- Projection
- Inclusion
- Until convergence
- Output q(x)
After Ghahramani
28BP is a special case of EP
- BP assumes fully factorized
- At each iteration, for each factor i, for each
node k, KL projection matches moments (computes
marginals by absorbing from neighbors)
Xn1
fj
Xk
fi
Xn2
29TreeEP
Minka03
- TreeEP assumes q(x) is represented by a tree
(regardless of true model topology). - We can use the Jtree algorithm to do the moment
matching at each iteration. - Faster and more accurate than LBP.
- Faster and comparably accurate to GBP.
30Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
31MPE in MRFs
- MAP estimation energy minimization
- Simplifications
- Only pairwise potentials Eijk0 etc
- Special form for potentials
- Binary variables xi 2 0,1
32Kinds of potential
- Metric
- Semi-metric satisfies (2) (3)
- Piecewise constant, eg.
- Potts model (metric)
- Piecewise smooth, eg.
- Semi-metric
- Metric
- Discontinuity-preserving potentials avoid
oversmoothing
33GraphCuts
Kolmogorov04
- Thm we can find argmin E(x) for binary variables
and pairwise potentials in at most O(N3) time
using maxflow/ mincut algorithm on the graph
below iff potentials are submodular i.e.,
- Metric potentials (eg. Potts) are always
submodular. - Thm the general case (eg. non-binary or
non-submodular) is NP-hard.
where
34Finding strong local minimum
- For the non-binary case, we can optimum wrt some
large space of moves by iteratively solving
binary subproblems. - ?-expansion any pixel can change to ?
- ?-? swap any ? can switch to ? and vice versa
Picture from Zabih
35Finding strong local minimum
- Start with arbitrary assignment f
- Done false
- While done
- Done true
- For each label ?
- Find
- If E(f) lt E(f) then done false f f
Binary subproblem!
36Properties of the 2 algorithms
- ?-expansion
- Requires V to be submodular (eg metric)
- O(L) per cycle
- Factor of 2c(V) within optimal
- c1 for Potts model
- ?-? swap
- Requires V to be semi-metric
- O(L2) per cycle
- No comparable theoretical guarantee, but works
well in practice
37Summary of inference methods for pairwise MRFs
- Marginals
- Mean field
- Loopy/ generalized BP (sum-product)
- EP
- Gibbs sampling
- Swendsen-Wang
- MPE/ Viterbi
- Iterative conditional modes (ICM)
- Loopy/generalized BP (max-product)
- Graph cuts
- Simulated annealing
See Boykov01, Weiss01 and Tappen03 for some
empirical comparisons
38Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
39Monte Carlo (sampling) methods
- Goal estimate
- e.g.,
- Draw N independent samples xr P
- Hard to draw (independent) samples from P
Accuracy is independentof dimensionality of X
40Importance Sampling
- We sample from Q(x) and reweight
Require Q(x)gt0 for all where P(x)gt0
P
Q
41Importance Sampling for BNs (likelihood weighting)
- Input CPDs P(XiX?i), evidence xE
- Output
- For each sample r
- wr 1
- For each node i in topological order
- If Xi is observed
- Then xir xiE wr wr P(XixiEX?i x?ir)
- Else xir P(Xix?ir)
42Drawbacks of importance sampling
- Sample given upstream evidence, weight by
downstream evidence. - Evidence reversal modify model to make all
observed nodes be parents can be expensive - Does not scale to high dimensional spaces, even
if Q similar to P, since variance of weights too
high.
43Sequential importance sampling (particle
filtering)
Arulampalam02,Doucet01
- Apply importance sampling to a (nonlinear,
nonGaussian) dynamical system. - Resample particles wp wt
- Unlikely hypotheses get replaced
44Markov Chain Monte Carlo (MCMC)
Neal93,Mackay98
- Draw dependent samples xt from a chain with
transition kernel T(x x), s.t. - P(x) is the stationary distribution
- The chain is ergodic (all states can get to the
stationary states) - If T satisfies detailed balancethen P ?
45Metropolis Hastings
- Sample xt Q(xxt-1)
- Accept new state with probability
- Satisfies detailed balance
46Gibbs sampling
- Metropolis method where Q is defined in terms of
conditionals P(XiX-i). - Acceptance rate 1.
- For graphical model, only need to condition on
the Markov blanket
See BUGS software
47Difficulties with MCMC
- May take long time to mix (converge to
stationary distribution). - Hard to know when chain has mixed.
- Simple proposals exhibit random walk behavior.
- Hybrid Monte Carlo (use gradient information)
- Swendsen-Wang (large moves for Ising model)
- Heuristic proposals
48Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
49Comparison of deterministic and stochastic methods
- Deterministic
- fast but inaccurate
- Stochastic
- slow but accurate
- Can handle arbitrary hypothesis space
- Combine best of both worlds (hybrid)
- Use smart deterministic proposals
- Integrate out some of the states, sample the rest
(Rao-Blackwellization) - Non-parametric BP (particle filtering for graphs)
50Examples of deterministic proposals
- State estimation
- Unscented particle filter
- Machine learning
- Variational MCMC
- Computer vision
- Data driven MCMC
Merwe00
deFreitas01
Tu02
51Example of Rao-Blackwellized particle filters
- Conditioned on the discrete switching nodes, the
remaining system is linear Gaussian and can be
integrated out using the Kalman filter. - Each particle contains sample value str and mean/
covariance for P(Xty1t, s1tr)
52Outline
- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
- Summary
53Summary of inference methods
Chain (online) Low treewidth High treewidth
Discrete BP forwards Boyen-Koller (ADF), beam search VarElim, Jtree, recursive conditioning Loopy BP, mean field, structured variational, EP, graphcuts Gibbs
Gaussian BP Kalman filter Jtree sparse linear algebra Loopy BP Gibbs
Other EKF, UKF, moment matching (ADF) Particle filter EP, EM, VB, NBP, Gibbs EP, variational EM, VB, NBP, Gibbs
BPbelief propagation, EP expectation
propagation, ADF assumed density filtering, EKF
extended Kalman filter, UKF unscented Kalman
filter, VarElim variable elimination, Jtree
junction tree, EM expectation maximization, VB
variational Bayes, NBP non-parametric BP