Loading...

PPT – Exact and approximate inference in probabilistic graphical models PowerPoint presentation | free to download - id: d4c47-ZDc1Z

The Adobe Flash plugin is needed to view this content

Exact and approximate inference in probabilistic

graphical models

- Kevin MurphyMIT CSAIL UBC CS/Stats

www.ai.mit.edu/murphyk/AAAI04

AAAI 2004 tutorial

Recommended reading

- Cowell, Dawid, Lauritzen, Spiegelhalter,

Probabilistic Networks and Expert Systems 1999 - Jensen 2001, Bayesian Networks and Decision

Graphs - Jordan (due 2005) Probabilistic graphical

models - Koller Friedman (due 2005), Bayes nets and

beyond - Learning in graphical models,edited M. Jordan

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

2 reasons for approximate inference

Low treewidth BUTNon-linear/ Non-Gaussian

High tree width

Chains

Nnxn grid

eg non-linear dynamical system

Trees (no loops)

eg (Bayesian) parameter estimation

X3

X1

X2

Loopy graphs

Y3

Y1

Y2

?

Complexity of approximate inference

- Approximating P(XqXe) to within a constant

factor for all discrete BNs is NP-hard. - In practice, many models exhibit weak coupling,

so we may safely ignore certain dependencies. - Computing P(XqXe) for all polytrees with

discrete and Gaussian nodes is NP-hard. - In practice, some of the modes of the posterior

will have negligible mass.

Dagum93

Lerner01

2 objective functions

- Approximate true posterior P(hv) by Q(h)
- Variational globally optimize all terms wrt

simpler Q - Expectation propagation (EP) sequentially

optimize each term

P

Q

P

Q

min D(QP)

min D(PQ)

Q0 gt P0

P0 gt Q0

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

Free energy

- Variational goal minimize D(PQ) wrt Q, where Q

has a simpler form than P - P(h,v) simpler than P(hv), so use
- Free energy is upper bound on neg log-likelihood

Point estimation

- Use
- Minimize
- Iterative Conditional Modes (ICM)
- For each iteration, for each hi
- Example K-means clustering
- Ignores uncertainty in P(hv), P(?v)
- Tends to get stuck in local minima

Factors in markov blanket of hi

Expectation Maximization (EM)

- Point estimates for parameters (ML or MAP), full

posterior for hidden vars. - E-step minimize F(Q,P) wrt Q(h)
- M-step minimize F(Q,P) wrt Q(h?)

Exact inference

Parameter prior

Expected complete-data log-likelihood

EM tricks of the trade

- Generalized EM
- Partial M-step reduce F(Q,P) wrt Q(h?)e.g.,

gradient method - Partial E-step reduce F(Q,P) wrt

Q(h)approximate inference - Avoiding local optima
- Deterministic Annealing
- Data resampling
- Speedup tricks
- Combine with conjugate gradient
- Online/incremental updates

Neal98

Rose98

Elidan02

Salakhutdinov03

Bauer97,Neal98

Variational Bayes (VB)

Ghahramani00,Beal02

- Use
- For exponential family models with conjugate

priors, this results in a generalized version of

EM - E-step modified inference to take into account

uncertainty of parameters - M-step optimize Q(h?) using expected sufficient

statistics - Variational Message Passing automates this,

assuming a fully factorized (mean field) Q

Winn04

variational-Bayes.org

Variational inference for discrete state models

with high treewidth

- We assume the parameters are fixed.
- We assume Q(h) has a simple form, so we can

easily find - Mean field
- Structured variational

Xing04

Mean field

Product of chains

Grid MRF

Variational inference for MRFs

- Probability is exp(-energy)
- Free energy average energy - entropy

Mean field for MRFs

- Fully factorized approximation
- Normalization constraint
- Average energy
- Entropy
- Local minima satisfy

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

BP vs mean field for MRFs

- Mean field updates
- BP updates
- Every node i sends a different message to j
- Empirically, BP much better than MF (e.g., MF is

not exact even for trees) - BP is (attempting to) minimize the Bethe free

energy

Weiss01

Yedidia01

Bethe free energy

- We assume the graph is a tree, in which case the

following is exact - Constraints
- Normalization
- Marginalization
- Average energy
- Entropy

di neighbors for node i

BP minimizes Bethe free energy

Yedidia01

- Theorem Yedida, Freeman, Weiss fixed points of

BP are local stationary points of the Bethe free

energy - BP may not converge other algorithms can

directly minimize F?, but are slower. - If BP does not converge, it often means F? is a

poor approximation

Kikuchi free energy

- Cluster groups of nodes together
- Energy per region
- Free energy per region
- Kikuchi free energy

Counting numbers

Counting numbers

3

1

2

1

2

3

6

4

5

6

4

5

Bethe

Kikuchi

12 23 14 25 36 45 56

Region graphs

1245 2356

1 2 3 4 5 6

25

C -1 -2 -1 -1 -2 -1

C1-(11)-1

Fkikuchi is exact if region graph contains 2

levels (regions and intersections)and has no

cycles equivalent to junction tree!

Generalized BP

3

1

2

- 2356 4578 5689

25 45 56 58

6

4

5

5

9

7

8

- Fkikuchu no longer exact, but more accurate than

Fbethe - Generalized BP can be used to minimize Fkikuchi
- This method of choosing regions is called the

cluster variational method - In the limit, we recover the junction tree

algorithm.

Welling04

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

Expectation Propagation (EP)

Minka01

- EP iterated assumed density filtering
- ADF recursive Bayesian estimation interleaved

with projection step - Examples of ADF
- Extended Kalman filtering
- Moment-matching (weak marginalization)
- Boyen-Koller algorithm
- Some online learning algorithms

Assumed Density Filtering (ADF)

x

Recursive Bayesian estimation(sequential

updating of posterior)

Y1

Yn

- If p(yix) not conjugate to p(x), then p(xy1i)

may not be tractably representable - So project posterior back to representable family
- And repeat

update

project

Projection becomes moment matching

Expectation Propagation

- ADF is sensitive to the order of updates.
- ADF approximates each posterior myopically.
- EP iteratively approximate each term.

intractable

Simple, non-iterative, inaccurate

EP

Simple, iterative, accurate

After Ghahramani

Expectation Propagation

- Input
- Initialize
- Repeat
- For i0..N
- Deletion
- Projection
- Inclusion
- Until convergence
- Output q(x)

After Ghahramani

BP is a special case of EP

- BP assumes fully factorized
- At each iteration, for each factor i, for each

node k, KL projection matches moments (computes

marginals by absorbing from neighbors)

Xn1

fj

Xk

fi

Xn2

TreeEP

Minka03

- TreeEP assumes q(x) is represented by a tree

(regardless of true model topology). - We can use the Jtree algorithm to do the moment

matching at each iteration. - Faster and more accurate than LBP.
- Faster and comparably accurate to GBP.

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Variational
- Loopy belief propagation
- Expectation propagation
- Graph cuts
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

MPE in MRFs

- MAP estimation energy minimization
- Simplifications
- Only pairwise potentials Eijk0 etc
- Special form for potentials
- Binary variables xi 2 0,1

Kinds of potential

- Metric
- Semi-metric satisfies (2) (3)
- Piecewise constant, eg.
- Potts model (metric)
- Piecewise smooth, eg.
- Semi-metric
- Metric
- Discontinuity-preserving potentials avoid

oversmoothing

GraphCuts

Kolmogorov04

- Thm we can find argmin E(x) for binary variables

and pairwise potentials in at most O(N3) time

using maxflow/ mincut algorithm on the graph

below iff potentials are submodular i.e.,

- Metric potentials (eg. Potts) are always

submodular. - Thm the general case (eg. non-binary or

non-submodular) is NP-hard.

where

Finding strong local minimum

- For the non-binary case, we can optimum wrt some

large space of moves by iteratively solving

binary subproblems. - ?-expansion any pixel can change to ?
- ?-? swap any ? can switch to ? and vice versa

Picture from Zabih

Finding strong local minimum

- Start with arbitrary assignment f
- Done false
- While done
- Done true
- For each label ?
- Find
- If E(f) lt E(f) then done false f f

Binary subproblem!

Properties of the 2 algorithms

- ?-expansion
- Requires V to be submodular (eg metric)
- O(L) per cycle
- Factor of 2c(V) within optimal
- c1 for Potts model
- ?-? swap
- Requires V to be semi-metric
- O(L2) per cycle
- No comparable theoretical guarantee, but works

well in practice

Summary of inference methods for pairwise MRFs

- Marginals
- Mean field
- Loopy/ generalized BP (sum-product)
- EP
- Gibbs sampling
- Swendsen-Wang
- MPE/ Viterbi
- Iterative conditional modes (ICM)
- Loopy/generalized BP (max-product)
- Graph cuts
- Simulated annealing

See Boykov01, Weiss01 and Tappen03 for some

empirical comparisons

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

Monte Carlo (sampling) methods

- Goal estimate
- e.g.,
- Draw N independent samples xr P
- Hard to draw (independent) samples from P

Accuracy is independentof dimensionality of X

Importance Sampling

- We sample from Q(x) and reweight

Require Q(x)gt0 for all where P(x)gt0

P

Q

Importance Sampling for BNs (likelihood weighting)

- Input CPDs P(XiX?i), evidence xE
- Output
- For each sample r
- wr 1
- For each node i in topological order
- If Xi is observed
- Then xir xiE wr wr P(XixiEX?i x?ir)
- Else xir P(Xix?ir)

Drawbacks of importance sampling

- Sample given upstream evidence, weight by

downstream evidence. - Evidence reversal modify model to make all

observed nodes be parents can be expensive - Does not scale to high dimensional spaces, even

if Q similar to P, since variance of weights too

high.

Sequential importance sampling (particle

filtering)

Arulampalam02,Doucet01

- Apply importance sampling to a (nonlinear,

nonGaussian) dynamical system. - Resample particles wp wt
- Unlikely hypotheses get replaced

Markov Chain Monte Carlo (MCMC)

Neal93,Mackay98

- Draw dependent samples xt from a chain with

transition kernel T(x x), s.t. - P(x) is the stationary distribution
- The chain is ergodic (all states can get to the

stationary states) - If T satisfies detailed balancethen P ?

Metropolis Hastings

- Sample xt Q(xxt-1)
- Accept new state with probability
- Satisfies detailed balance

Gibbs sampling

- Metropolis method where Q is defined in terms of

conditionals P(XiX-i). - Acceptance rate 1.
- For graphical model, only need to condition on

the Markov blanket

See BUGS software

Difficulties with MCMC

- May take long time to mix (converge to

stationary distribution). - Hard to know when chain has mixed.
- Simple proposals exhibit random walk behavior.
- Hybrid Monte Carlo (use gradient information)
- Swendsen-Wang (large moves for Ising model)
- Heuristic proposals

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic

Comparison of deterministic and stochastic methods

- Deterministic
- fast but inaccurate
- Stochastic
- slow but accurate
- Can handle arbitrary hypothesis space
- Combine best of both worlds (hybrid)
- Use smart deterministic proposals
- Integrate out some of the states, sample the rest

(Rao-Blackwellization) - Non-parametric BP (particle filtering for graphs)

Examples of deterministic proposals

- State estimation
- Unscented particle filter
- Machine learning
- Variational MCMC
- Computer vision
- Data driven MCMC

Merwe00

deFreitas01

Tu02

Example of Rao-Blackwellized particle filters

- Conditioned on the discrete switching nodes, the

remaining system is linear Gaussian and can be

integrated out using the Kalman filter. - Each particle contains sample value str and mean/

covariance for P(Xty1t, s1tr)

Outline

- Introduction
- Exact inference
- Approximate inference
- Deterministic
- Stochastic (sampling)
- Hybrid deterministic/ stochastic
- Summary

Summary of inference methods

Chain (online) Low treewidth High treewidth

Discrete BP forwards Boyen-Koller (ADF), beam search VarElim, Jtree, recursive conditioning Loopy BP, mean field, structured variational, EP, graphcuts Gibbs

Gaussian BP Kalman filter Jtree sparse linear algebra Loopy BP Gibbs

Other EKF, UKF, moment matching (ADF) Particle filter EP, EM, VB, NBP, Gibbs EP, variational EM, VB, NBP, Gibbs

BPbelief propagation, EP expectation

propagation, ADF assumed density filtering, EKF

extended Kalman filter, UKF unscented Kalman

filter, VarElim variable elimination, Jtree

junction tree, EM expectation maximization, VB

variational Bayes, NBP non-parametric BP