Title: Exact and approximate inference in probabilistic graphical models
1Exact and approximate inference in probabilistic
graphical models
- Kevin MurphyMIT CSAIL UBC CS/Stats
www.ai.mit.edu/murphyk/AAAI04
AAAI 2004 tutorial
2Outline
- Introduction
- Exact inference
- Approximate inference
3Outline
- Introduction
- What are graphical models?
- What is inference?
- Exact inference
- Approximate inference
4Probabilistic graphical models
Probabilistic models
Graphical models
Undirected
Directed
(Markov Randomfields - MRFs)
(Bayesian networks)
5Bayesian networks
- Directed acyclic graph (DAG)
- Nodes random variables
- Edges direct influence (causation)
- Xi ? Xancestors Xparents
- e.g., C ? R,B,E A
- Simplifies chain rule by using conditional
independencies
Earthquake
Burglary
Alarm
Radio
Call
Pearl, 1988
6Conditional probability distributions (CPDs)
Earthquake
Burglary
- Each node specifies a distribution over its
values given its parents values P(Xi XPai) - Full table needs 25-131parameters, BN needs 10
Alarm
Radio
Call
Pearl, 1988
7Example BN Hidden Markov Model (HMM)
Hidden stateseg. words
Observationseg. sounds
8CPDs for HMMs
1
2
3
Astate transition matrix
A
X1
X2
X3
?
Parameter tyeing
Y1
Y3
B
Y2
Transition matrix
Observation matrix
Initial state distribution
9Markov Random Fields (MRFs)
- Undirected graph
- Xi ? Xrest Xnbrs
- Each clique c has a potential function ?c
Hammersley-Clifford theorem P(X) (1/Z) ?c
yc(Xc)
The normalization constant (partition function) is
10Potentials for MRFs
One potential per maximal clique
One potential per edge
?12
?123
?23
?13
?34
?35
?34
?35
11Example MRF Ising/ Potts model
y
?
?
Parametertying
?
?
?
?
x
?
?
Compatibility with neighbors
Local evidence (compatibility with image)
12Conditional Random Field (CRF)
Lafferty01,Kumar03,etc
y
?
?
Parametertying
?
?
?
?
x
?
?
Local evidence (compatibility with image)
Compatibility with neighbors
13Directed vs undirected models
X1
X2
X1
X2
moralization
X3
X3
X5
X4
X5
X4
separation gt cond. indepence
d-separation gt cond. indepence
Parameter learning hard
Parameter learning easy
Inference is same!
14Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
15Outline
- Introduction
- What are graphical models?
- What is inference?
- Exact inference
- Approximate inference
16Inference (state estimation)
Earthquake
Burglary
Alarm
Radio
Call
Ct
17Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Call
Ct
18Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Rt
Call
Ct
19Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
20Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Explaining away effect
21Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Probability theory is nothing but common sense
reduced to calculation Pierre Simon Laplace
22Inference tasks
- Posterior probabilities of Query given Evidence
- Marginalize out Nuisance variables
- Sum-product
- Most Probable Explanation (MPE)/ Viterbi
- max-product
- Marginal Maximum A Posteriori (MAP)
- max-sum-product
23Causal vs diagnostic reasoning
- Sometimes easier to specify P(effectcause) than
P(causeeffect) stable mechanism - Use Bayes rule to invert causal model
Diseases, H
Symptoms, v
24Applications of Bayesian inference
25Decision theory
- Decision theory probability theory utility
theory - Bayesian networks actions/utilities
influence/ decision diagrams - Maximize expected utility
26Outline
- Introduction
- Exact inference
- Brute force enumeration
- Variable elimination algorithm
- Complexity of exact inference
- Belief propagation algorithm
- Junction tree algorithm
- Linear Gaussian models
- Approximate inference
27Brute force enumeration
- We can compute in O(KN) time, where KXi
- By using BN, we can represent joint in O(N) space
B
E
A
M
J
28Brute force enumeration
Russell Norvig
29Enumeration tree
Russell Norvig
30Enumeration tree contains repeated sub-expressions
31Variable/bucket elimination
Kschischang01,Dechter96
- Push sums inside products (generalized
distributive law) - Carry out summations right to left, storing
intermediate results (factors) to avoid
recomputation (dynamic programming)
32Variable elimination
33VarElim basic operations
- Pointwise product
- Summing out
Only multiply factors which contain summand (lazy
evaluation)
34Variable elimination
Russell Norvig
35Outline
- Introduction
- Exact inference
- Brute force enumeration
- Variable elimination algorithm
- Complexity of exact inference
- Belief propagation algorithm
- Junction tree algorithm
- Linear Gaussian models
- Approximate inference
36VarElim on loopy graphs
Let us work right-to-left, eliminating variables,
and adding arcs to ensurethat any two terms that
co-occur in a factor are connected in the graph
2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
37Complexity of VarElim
- Time/space for single query O(N Kw1) for N
nodes of K states, where ww(G, ?) width of
graph induced by elimination order ? - w argmin? w(G,?) treewidth of G
- Thm finding an order to minimize treewidth is
NP-complete - Does there exist a more efficient exact inference
algorithm?
Yannakakis81
38Exact inference is P-complete
Dagum93
- Can reduce 3SAT to exact inferencegt NP-hard
- Equivalent to counting num. satisfying
assignments gt P-complete
A
C
D
Literals
B
P(A)P(B)P(C)P(D)0.5 C1 A v B v C C2 C v
D v A C3 B v C v D S C1 C2 C3
Clauses
C1
C2
C3
Sentence
S
39Summary so far
- Brute force enumeration O(KN) time,O(N KC) space
(where Cmax clique size) - VarElim O(N Kw1) time/space
- w w(G,?) induced treewidth
- Exact inference is P-complete
- Motivates need for approximate inference
40Treewidth
Low treewidth
High tree width
Chains
Nnxn grid
W 1
Trees (no loops)
Arnborg85
W O(n) O(p N)
Loopy graphs
W parents
W NP-hard to find
41Graph triangulation
Golumbic80
- A graph is triangulated (chordal, perfect) if it
has no chordless cycles of length gt 3. - To triangulate a graph, for each node Xi in order
?, ensure all neighbors of Xi form a clique by
adding fill-in edges then remove Xi
2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
42Graph triangulation
- A graph is triangulated (chordal) if it has no
chordless cycles of length gt 3
5
Not triangulated!
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 6-cycle
43Graph triangulation
- Triangulation is not just adding triangles
5
Still not triangulated
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 4-cycle
44Graph triangulation
- Triangulation creates large cliques
5
Triangulated at last!
7
7
4
1
4
1
2
2
5
8
8
3
3
6
9
6
9
45Finding an elimination order
- The size of the induced clique depends on the
elimination order. - Since this is NP-hard to optimize, it is common
to apply greedy search techniques Kjaerulff90 - At each iteration, eliminate the node that would
result in the smallest - Num. fill-in edges min-fill
- Resulting clique weight min-weight (Weight of
clique product of number of states per node in
clique) - There are some approximation algorithms
Amir01
46Speedup tricks for VarElim
- Remove nodes that are irrelevant to the query
- Exploit special forms of P(XiX?i) to sum out
variables efficiently
47Irrelevant variables
B
E
A
M
J
- M is irrelevant to computing P(jb)
- Thm Xi irrelevant unlessXi 2 Ancestors(XQ
XE) - Here, Ancestors(J B) A,E
- gt hidden leaves (barren nodes) can always be
removed
1
48Irrelevant variables
B
E
A
M
J
- M, B and E irrelevant to computing P(ja)
- All variables relevant to a query can be found in
O(N) time - Variable elimination supports query-specific
optimizations
49Structured CPDs
- Sometimes P(XiXpii) has special structure, which
we can exploit computationally - Context-specific independence (eg. CPDdecision
tree) - Causal independence (eg CPDnoisy-OR)
- Determinism
- Such non-graphical structure complicates the
search for the optimal triangulation
Boutilier96b,Zhang99
Rish98,Zhang96b
Zweig98,Bartels04
Bartels04
50Outline
- Introduction
- Exact inference
- Brute force enumeration
- Variable elimination algorithm
- Complexity of exact inference
- Belief propagation algorithm
- Junction tree algorithm
- Linear Gaussian models
- Approximate inference
51Whats wrong with VarElim
- Often we want to query all hidden nodes.
- VarElim takes O(N2 Kw1) time to compute P(Xixe)
for all (hidden) nodes i. - There exist message passing algorithms that can
do this in O(N Kw1) time. - Later, we will use these to do approximate
inference in O(N K2) time, indep of w.
X1
X3
X2
Y2
Y1
Y3
52Repeated variable elimination leads to redundant
calculations
X1
X2
X3
Y1
Y3
Y2
O(N2 K2) time to compute all N marginals
53Forwards-backwards algorithm
Rabiner89,etc
Xt1
Xt
Xt
Y1t-1
Yt1N
Yt
(Use dynamic programming to compute these)
Forwards prediction
Backwards prediction
Local evidence
54Forwards algorithm (filtering)
Xt
Xt
Y1t-1
Yt
55Backwards algorithm
Xt1
Xt2
Xt
Xt
Yt2N
Yt1
56Forwards-backwards algorithm
?24
?1
?12
- Forwards
- Backwards
- Combine
X24
X12
X1
b24
b12
b1
Backwards messages independent of forwards
messages
O(N K2) time to compute all N marginals, not O(N2
K2)
57Belief propagation
Pearl88,Shafer90,Yedidia01,etc
- Forwards-backwards algorithm can be generalized
to apply to any tree-like graph (ones with no
loops). - For now, we assume pairwise potentials.
58Absorbing messages
Xt
Xt1
Xt-1
Yt
59Sending messages
Xt
Xt1
Xt-1
Yt
60Centralized protocol
Collect to root (post-order)
Distribute from root (pre-order)
R
R
3
3
4
2
1
5
4
5
1
2
Computes all N marginals in 2 passes over graph
61Distributed protocol
Computes all N marginals in O(N) parallel updates
62Loopy belief propagation
- Applying BP to graphs with loops (cycles) can
give the wrong answer, because it overcounts
evidence - In practice, often works well (e.g., error
correcting codes)
Cloudy
Sprinkler
Rain
WetGrass
63Why Loopy BP?
- We can compute exact answers by converting a
loopy graph to a junction tree and running BP
(see later). - However, the resulting Jtree has nodes with
O(Kw1) states, so inference takes O(N Kw1) time
wclique size of triangulated graph. - We can apply BP to the original graph inO(N KC)
time C clique size of original graph. - To apply BP to a graph with non pairwise
potentials, it is simpler to use factor graphs.
64Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
65BP for factor graphs
Kschischang01
- Beliefs
- Message variable to factor
- Message factor to variable
f
x
f
f
x
f(x,y,x)
y
z
66Sum-product vs max-product
- Sum-product computes marginalsusing this rule
- Max-product computes max marginals
- using the rule
- Same algorithm on different semirings (,x,0,1)
and(max,x,-1,1)
Shafer90,Bistarelli97,Goodman99,Aji00
67Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
68Viterbi algorithm for HMMs
- Run max forwards algorithm, keeping track of most
probable predecessor for each state - Pointer traceback
- Can produce N-best list (most probable
configurations) in O(N T K2) time
Forney73,Nilsson01
69Loopy Viterbi
- Use max-product to compute/ approximate
- If there are no ties and the max-marginals are
exact, then - This method does not use traceback, so can be
used with distributed/ loopy BP - We can break ties, and produce N most-probable
configurations, by asserting that certain
assignments are disallowed, and rerunning
Yanover04
70BP speedup tricks
- Sometimes we can reduce the time to compute a
message from O(K2) to O(K) - If ?(xi,xj) exp(f(xi) f(xj)2), then
- Sum-product in O(K log K) time exact FFTor
O(K) time approx - Max-product in O(K) time distance transform
- For general (discrete) potentials, we can
dynamically add/delete states to reduce K - Sometimes we can speedup convergence by
- Using a better message-passing schedule (e.g.,
along embedded spanning trees) - Using a multiscale method
Felzenszwalb03/04,Movellan04,deFreitas04
Coughlan04
Wainwright01
Felzenszwalb04
71Outline
- Introduction
- Exact inference
- Variable elimination algorithm
- Complexity of exact inference
- Belief propagation algorithm
- Junction tree algorithm
- Linear Gaussian models
- Approximate inference
72Junction/ join/ clique trees
- To perform exact inference in an arbitrary graph,
convert it to a junction tree, and then perform
belief propagation. - A jtree is a tree whose nodes are sets, and which
has the Jtree property all sets which contain
any given variable form a connected graph
(variable cannot appear in 2 disjoint places)
CSR
C
moralize
Make jtree
S
R
SR
W
Maximal cliques C,S,R, S,R,W
SRW
Separators C,S,R Å S,R,W S,R
73Making a junction tree
GM
G
B
D
moralize
A
F
C
E
Triangulate (order f,d,e,c,b,a)
Wij Ci Å Cj
a,b,c
1
b,d
b,e,f
Jensen94
B
D
Max spanning tree
Findmax cliques
1
b,c,e
1
1
b,d
A
2
F
b,c,e
a,b,c
2
b,e,f
C
E
Jgraph
Jtree
GT
74Clique potentials
CSR
Each model clique potential gets assigned to
oneJtree clique potential
SR
Each observed variable assigns a delta
functionto one Jtree clique potential
If we observe Ww, set E(w)?(w,w), else E(w)1
SRW
Square nodes are factors
75Separator potentials
CSR
Separator potentials enforce consistency
betweenneighboring cliques on common variables.
SR
SRW
Square nodes are factors
76BP on a Jtree
- A Jtree is a MRF with pairwise potentials.
- Each (clique) node potential contains CPDs and
local evidence. - Each edge potential acts like a projection
function. - We do a forwards (collect) pass, then a backwards
(distribute) pass. - The result is the Hugin/ Shafer-Shenoy algorithm.
CSR
1
4
SR
2
3
SRW
77BP on a Jtree (collect)
CSR
Initial clique potentials contain CPDs and
evidence
SR
SRW
78BP on a Jtree (collect)
CSR
Message from clique to separator marginalizes
belief (projects onto intersection)remove c
SR
SRW
79BP on a Jtree (collect)
CSR
SR
Separator potentials gets marginal belieffrom
their parent clique.
SRW
80BP on a Jtree (collect)
CSR
SR
SRW
Message from separator to clique expands marginal
add w
81BP on a Jtree (collect)
CSR
SR
SRW
Root clique has seen all the evidence
82BP on a Jtree (distribute)
83BP on a Jtree (distribute)
CSR
CSR
Marginalize out w and excludeold evidence (ec,
er)
SR
SR
SRW
SRW
84BP on a Jtree (distribute)
CSR
CSR
Combine upstream and downstreamevidence
SR
SR
SRW
SRW
85BP on a Jtree (distribute)
Add c and excludeold evidence (ec, er)
CSR
CSR
SR
SR
SRW
SRW
86BP on a Jtree (distribute)
CSR
CSR
SR
SR
SRW
SRW
Combine upstream and downstream evidence
87Partial beliefs
CSR
CSR
SR
SR
SRW
SRW
Evidence on R now added here
- The beliefs/ messages at intermediate stages
(before finishing both passes)may not be
meaningful, because any given clique may not have
seen all themodel potentials/ evidence (and
hence may not be normalizable). - This can cause problems when messages may fail
(eg. Sensor nets). - One must reparameterize using the decomposable
model to ensure meaningfulpartial beliefs.
Paskin04
88Hugin algorithm
Hugin BP applied to a Jtree using a serial
protocol
Collect
Distribute
Ci
Ci
Sij
Sij
Cj
Cj
Square nodes are separators
89Shafer-Shenoy algorithm
- SS BP on a Jtree, but without separator nodes.
- Multiplies by all-but-one messages instead of
dividing out by old beliefs. - Uses less space but more time than Hugin.
Lepar98
90Other Jtree variants
- Strong Jtree is created using a constrained
elimination order. Useful for - Decision (influence) diagrams
- Hybrid (conditional Gaussian) networks
- MAP (max-sum-product) inference
- Lazy Jtree
- Multiplies model potentials only when needed
- Can exploit special structure in CPDs
Cowell99
Madsen99
91Outline
- Introduction
- Exact inference
- Variable elimination algorithm
- Complexity of exact inference
- Belief propagation algorithm
- Junction tree algorithm
- Linear Gaussian models
- Approximate inference
92Gaussian MRFs
- Absent arcs correspond to 0s in the precision
matrix V-1 (structural zeros)
93Gaussian Bayes nets
1
Shachter89
2
3
4
If each CPD has the form
Then the joint is
where
Absent arcs correspond to 0s in the regression
matrix B
94Gaussian marginalization/ conditioning
If
then
and
- Hence exact inference can be in O(N3) time
(matrix inversion). - Methods which exploit the sparsity structure of
? are equivalent tothe junction tree algorithm
and have complexity O(N w3), where w is
thetreewidth.
Paskin03b
95Gaussian BP
Weiss01
96Kalman filtering
Linear Dynamical System (LDS)/ State Space Model
(SSM)
Hidden state
Noisy observations
BP on an LDS model Kalman filtering/ smoothing
See my matlab toolbox
97Example LDS for 2D tracking
Constant velocity model
Sparse linear Gaussian system ) sparse graphs
98Non-linear, Gaussian models
X1
X3
Hidden state
X2
Y1
Y3
Y2
Noisy observations
- Extended Kalman filter (EKF) linearize f/g
around - Unscented Kalman filter (UKF) more accurate
than EKFand avoids need to compute gradients - Both EKF and UKF assume P(Xty1t) is unimodal
See Rebel matlab toolbox (OGI)
99Non-Gaussian models
Minka02
- If P(Xty1t) is multi-modal, we can
approximately represent it using - Mixtures of Gaussians (assumed density filtering
(ADF)) - Samples (particle filtering)
- Batch (offline) versions
- ADF -gt expectation propagation (EP)
- Particle filtering -gt particle smoothing or Gibbs
sampling