Exact and approximate inference in probabilistic graphical models

About This Presentation

Title:

Exact and approximate inference in probabilistic graphical models

Description:

Exact and approximate inference in probabilistic graphical models – PowerPoint PPT presentation

Number of Views:247

Avg rating:3.0/5.0

Slides: 100

Provided by: KPM60

Category:

more less

Transcript and Presenter's Notes

Title: Exact and approximate inference in probabilistic graphical models

1
Exact and approximate inference in probabilistic
graphical models

Kevin MurphyMIT CSAIL UBC CS/Stats

www.ai.mit.edu/murphyk/AAAI04
AAAI 2004 tutorial
2
Outline

Introduction
Exact inference
Approximate inference

3
Outline

Introduction
What are graphical models?
What is inference?
Exact inference
Approximate inference

4
Probabilistic graphical models
Probabilistic models
Graphical models
Undirected
Directed
(Markov Randomfields - MRFs)
(Bayesian networks)
5
Bayesian networks

Directed acyclic graph (DAG)
Nodes random variables
Edges direct influence (causation)
Xi ? Xancestors Xparents
e.g., C ? R,B,E A
Simplifies chain rule by using conditional
independencies

Earthquake
Burglary
Alarm
Radio
Call
Pearl, 1988
6
Conditional probability distributions (CPDs)
Earthquake
Burglary

Each node specifies a distribution over its
values given its parents values P(Xi XPai)
Full table needs 25-131parameters, BN needs 10

Alarm
Radio
Call
Pearl, 1988
7
Example BN Hidden Markov Model (HMM)
Hidden stateseg. words
Observationseg. sounds
8
CPDs for HMMs
1
2
3
Astate transition matrix
A
X1
X2
X3
?
Parameter tyeing
Y1
Y3
B
Y2
Transition matrix
Observation matrix
Initial state distribution
9
Markov Random Fields (MRFs)

Undirected graph
Xi ? Xrest Xnbrs
Each clique c has a potential function ?c

Hammersley-Clifford theorem P(X) (1/Z) ?c
yc(Xc)
The normalization constant (partition function) is
10
Potentials for MRFs
One potential per maximal clique
One potential per edge
?12
?123
?23
?13
?34
?35
?34
?35
11
Example MRF Ising/ Potts model
y
?
?
Parametertying
?
?
?
?
x
?
?
Compatibility with neighbors
Local evidence (compatibility with image)
12
Conditional Random Field (CRF)
Lafferty01,Kumar03,etc
y
?
?
Parametertying
?
?
?
?
x
?
?
Local evidence (compatibility with image)
Compatibility with neighbors
13
Directed vs undirected models
X1
X2
X1
X2
moralization
X3
X3
X5
X4
X5
X4
separation gt cond. indepence
d-separation gt cond. indepence
Parameter learning hard
Parameter learning easy
Inference is same!
14
Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
15
Outline

Introduction
What are graphical models?
What is inference?
Exact inference
Approximate inference

16
Inference (state estimation)
Earthquake
Burglary
Alarm
Radio
Call
Ct
17
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Call
Ct
18
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Rt
Call
Ct
19
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
20
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Explaining away effect
21
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Probability theory is nothing but common sense
reduced to calculation Pierre Simon Laplace
22
Inference tasks

Posterior probabilities of Query given Evidence
Marginalize out Nuisance variables
Sum-product
Most Probable Explanation (MPE)/ Viterbi
max-product
Marginal Maximum A Posteriori (MAP)
max-sum-product

23
Causal vs diagnostic reasoning

Sometimes easier to specify P(effectcause) than
P(causeeffect) stable mechanism
Use Bayes rule to invert causal model

Diseases, H
Symptoms, v
24
Applications of Bayesian inference
25
Decision theory

Decision theory probability theory utility
theory
Bayesian networks actions/utilities
influence/ decision diagrams
Maximize expected utility

26
Outline

Introduction
Exact inference
Brute force enumeration
Variable elimination algorithm
Complexity of exact inference
Belief propagation algorithm
Junction tree algorithm
Linear Gaussian models
Approximate inference

27
Brute force enumeration

We can compute in O(KN) time, where KXi
By using BN, we can represent joint in O(N) space

B
E
A
M
J
28
Brute force enumeration
Russell Norvig
29
Enumeration tree
Russell Norvig
30
Enumeration tree contains repeated sub-expressions
31
Variable/bucket elimination
Kschischang01,Dechter96

Push sums inside products (generalized
distributive law)
Carry out summations right to left, storing
intermediate results (factors) to avoid
recomputation (dynamic programming)

32
Variable elimination
33
VarElim basic operations

Pointwise product
Summing out

Only multiply factors which contain summand (lazy
evaluation)
34
Variable elimination
Russell Norvig
35
Outline

Introduction
Exact inference
Brute force enumeration
Variable elimination algorithm
Complexity of exact inference
Belief propagation algorithm
Junction tree algorithm
Linear Gaussian models
Approximate inference

36
VarElim on loopy graphs
Let us work right-to-left, eliminating variables,
and adding arcs to ensurethat any two terms that
co-occur in a factor are connected in the graph
2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
37
Complexity of VarElim

Time/space for single query O(N Kw1) for N
nodes of K states, where ww(G, ?) width of
graph induced by elimination order ?
w argmin? w(G,?) treewidth of G
Thm finding an order to minimize treewidth is
NP-complete
Does there exist a more efficient exact inference
algorithm?

Yannakakis81
38
Exact inference is P-complete
Dagum93

Can reduce 3SAT to exact inferencegt NP-hard
Equivalent to counting num. satisfying
assignments gt P-complete

A
C
D
Literals
B
P(A)P(B)P(C)P(D)0.5 C1 A v B v C C2 C v
D v A C3 B v C v D S C1 C2 C3
Clauses
C1
C2
C3
Sentence
S
39
Summary so far

Brute force enumeration O(KN) time,O(N KC) space
(where Cmax clique size)
VarElim O(N Kw1) time/space
w w(G,?) induced treewidth
Exact inference is P-complete
Motivates need for approximate inference

40
Treewidth
Low treewidth
High tree width
Chains
Nnxn grid
W 1
Trees (no loops)
Arnborg85
W O(n) O(p N)
Loopy graphs
W parents
W NP-hard to find
41
Graph triangulation
Golumbic80

A graph is triangulated (chordal, perfect) if it
has no chordless cycles of length gt 3.
To triangulate a graph, for each node Xi in order
?, ensure all neighbors of Xi form a clique by
adding fill-in edges then remove Xi

2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
42
Graph triangulation

A graph is triangulated (chordal) if it has no
chordless cycles of length gt 3

5
Not triangulated!
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 6-cycle
43
Graph triangulation

Triangulation is not just adding triangles

5
Still not triangulated
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 4-cycle
44
Graph triangulation

Triangulation creates large cliques

5
Triangulated at last!
7
7
4
1
4
1
2
2
5
8
8
3
3
6
9
6
9
45
Finding an elimination order

The size of the induced clique depends on the
elimination order.
Since this is NP-hard to optimize, it is common
to apply greedy search techniques Kjaerulff90
At each iteration, eliminate the node that would
result in the smallest
Num. fill-in edges min-fill
Resulting clique weight min-weight (Weight of
clique product of number of states per node in
clique)
There are some approximation algorithms

Amir01
46
Speedup tricks for VarElim

Remove nodes that are irrelevant to the query
Exploit special forms of P(XiX?i) to sum out
variables efficiently

47
Irrelevant variables
B
E
A
M
J

M is irrelevant to computing P(jb)
Thm Xi irrelevant unlessXi 2 Ancestors(XQ
XE)
Here, Ancestors(J B) A,E
gt hidden leaves (barren nodes) can always be
removed

1
48
Irrelevant variables
B
E
A
M
J

M, B and E irrelevant to computing P(ja)
All variables relevant to a query can be found in
O(N) time
Variable elimination supports query-specific
optimizations

49
Structured CPDs

Sometimes P(XiXpii) has special structure, which
we can exploit computationally
Context-specific independence (eg. CPDdecision
tree)
Causal independence (eg CPDnoisy-OR)
Determinism
Such non-graphical structure complicates the
search for the optimal triangulation

Boutilier96b,Zhang99
Rish98,Zhang96b
Zweig98,Bartels04
Bartels04
50
Outline

Introduction
Exact inference
Brute force enumeration
Variable elimination algorithm
Complexity of exact inference
Belief propagation algorithm
Junction tree algorithm
Linear Gaussian models
Approximate inference

51
Whats wrong with VarElim

Often we want to query all hidden nodes.
VarElim takes O(N2 Kw1) time to compute P(Xixe)
for all (hidden) nodes i.
There exist message passing algorithms that can
do this in O(N Kw1) time.
Later, we will use these to do approximate
inference in O(N K2) time, indep of w.

X1
X3
X2
Y2
Y1
Y3
52
Repeated variable elimination leads to redundant
calculations
X1
X2
X3
Y1
Y3
Y2
O(N2 K2) time to compute all N marginals
53
Forwards-backwards algorithm
Rabiner89,etc
Xt1
Xt
Xt
Y1t-1
Yt1N
Yt
(Use dynamic programming to compute these)
Forwards prediction
Backwards prediction
Local evidence
54
Forwards algorithm (filtering)
Xt
Xt
Y1t-1
Yt
55
Backwards algorithm
Xt1
Xt2
Xt
Xt
Yt2N
Yt1
56
Forwards-backwards algorithm
?24
?1
?12

Forwards
Backwards
Combine

X24
X12
X1
b24
b12
b1
Backwards messages independent of forwards
messages
O(N K2) time to compute all N marginals, not O(N2
K2)
57
Belief propagation
Pearl88,Shafer90,Yedidia01,etc

Forwards-backwards algorithm can be generalized
to apply to any tree-like graph (ones with no
loops).
For now, we assume pairwise potentials.

58
Absorbing messages
Xt
Xt1
Xt-1
Yt
59
Sending messages
Xt
Xt1
Xt-1
Yt
60
Centralized protocol
Collect to root (post-order)
Distribute from root (pre-order)
R
R
3
3
4
2
1
5
4
5
1
2
Computes all N marginals in 2 passes over graph
61
Distributed protocol
Computes all N marginals in O(N) parallel updates
62
Loopy belief propagation

Applying BP to graphs with loops (cycles) can
give the wrong answer, because it overcounts
evidence
In practice, often works well (e.g., error
correcting codes)

Cloudy
Sprinkler
Rain
WetGrass
63
Why Loopy BP?

We can compute exact answers by converting a
loopy graph to a junction tree and running BP
(see later).
However, the resulting Jtree has nodes with
O(Kw1) states, so inference takes O(N Kw1) time
wclique size of triangulated graph.
We can apply BP to the original graph inO(N KC)
time C clique size of original graph.
To apply BP to a graph with non pairwise
potentials, it is simpler to use factor graphs.

64
Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
65
BP for factor graphs
Kschischang01

Beliefs
Message variable to factor
Message factor to variable

f
x
f
f
x
f(x,y,x)
y
z
66
Sum-product vs max-product

Sum-product computes marginalsusing this rule
Max-product computes max marginals
using the rule
Same algorithm on different semirings (,x,0,1)
and(max,x,-1,1)

Shafer90,Bistarelli97,Goodman99,Aji00
67
Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
68
Viterbi algorithm for HMMs

Run max forwards algorithm, keeping track of most
probable predecessor for each state
Pointer traceback
Can produce N-best list (most probable
configurations) in O(N T K2) time

Forney73,Nilsson01
69
Loopy Viterbi

Use max-product to compute/ approximate
If there are no ties and the max-marginals are
exact, then
This method does not use traceback, so can be
used with distributed/ loopy BP
We can break ties, and produce N most-probable
configurations, by asserting that certain
assignments are disallowed, and rerunning

Yanover04
70
BP speedup tricks

Sometimes we can reduce the time to compute a
message from O(K2) to O(K)
If ?(xi,xj) exp(f(xi) f(xj)2), then
Sum-product in O(K log K) time exact FFTor
O(K) time approx
Max-product in O(K) time distance transform
For general (discrete) potentials, we can
dynamically add/delete states to reduce K
Sometimes we can speedup convergence by
Using a better message-passing schedule (e.g.,
along embedded spanning trees)
Using a multiscale method

Felzenszwalb03/04,Movellan04,deFreitas04
Coughlan04
Wainwright01
Felzenszwalb04
71
Outline

Introduction
Exact inference
Variable elimination algorithm
Complexity of exact inference
Belief propagation algorithm
Junction tree algorithm
Linear Gaussian models
Approximate inference

72
Junction/ join/ clique trees

To perform exact inference in an arbitrary graph,
convert it to a junction tree, and then perform
belief propagation.
A jtree is a tree whose nodes are sets, and which
has the Jtree property all sets which contain
any given variable form a connected graph
(variable cannot appear in 2 disjoint places)

CSR
C
moralize
Make jtree
S
R
SR
W
Maximal cliques C,S,R, S,R,W
SRW
Separators C,S,R Å S,R,W S,R
73
Making a junction tree
GM
G
B
D
moralize
A
F
C
E
Triangulate (order f,d,e,c,b,a)
Wij Ci Å Cj
a,b,c
1
b,d
b,e,f
Jensen94
B
D
Max spanning tree
Findmax cliques
1
b,c,e
1
1
b,d
A
2
F
b,c,e
a,b,c
2
b,e,f
C
E
Jgraph
Jtree
GT
74
Clique potentials
CSR
Each model clique potential gets assigned to
oneJtree clique potential
SR
Each observed variable assigns a delta
functionto one Jtree clique potential
If we observe Ww, set E(w)?(w,w), else E(w)1
SRW
Square nodes are factors
75
Separator potentials
CSR
Separator potentials enforce consistency
betweenneighboring cliques on common variables.
SR
SRW
Square nodes are factors
76
BP on a Jtree

A Jtree is a MRF with pairwise potentials.
Each (clique) node potential contains CPDs and
local evidence.
Each edge potential acts like a projection
function.
We do a forwards (collect) pass, then a backwards
(distribute) pass.
The result is the Hugin/ Shafer-Shenoy algorithm.

CSR
1
4
SR
2
3
SRW
77
BP on a Jtree (collect)
CSR
Initial clique potentials contain CPDs and
evidence
SR
SRW
78
BP on a Jtree (collect)
CSR
Message from clique to separator marginalizes
belief (projects onto intersection)remove c
SR
SRW
79
BP on a Jtree (collect)
CSR
SR
Separator potentials gets marginal belieffrom
their parent clique.
SRW
80
BP on a Jtree (collect)
CSR
SR
SRW
Message from separator to clique expands marginal
add w
81
BP on a Jtree (collect)
CSR
SR
SRW
Root clique has seen all the evidence
82
BP on a Jtree (distribute)
83
BP on a Jtree (distribute)
CSR
CSR
Marginalize out w and excludeold evidence (ec,
er)
SR
SR
SRW
SRW
84
BP on a Jtree (distribute)
CSR
CSR
Combine upstream and downstreamevidence
SR
SR
SRW
SRW
85
BP on a Jtree (distribute)
Add c and excludeold evidence (ec, er)
CSR
CSR
SR
SR
SRW
SRW
86
BP on a Jtree (distribute)
CSR
CSR
SR
SR
SRW
SRW
Combine upstream and downstream evidence
87
Partial beliefs
CSR
CSR
SR
SR
SRW
SRW
Evidence on R now added here

The beliefs/ messages at intermediate stages
(before finishing both passes)may not be
meaningful, because any given clique may not have
seen all themodel potentials/ evidence (and
hence may not be normalizable).
This can cause problems when messages may fail
(eg. Sensor nets).
One must reparameterize using the decomposable
model to ensure meaningfulpartial beliefs.

Paskin04
88
Hugin algorithm
Hugin BP applied to a Jtree using a serial
protocol
Collect
Distribute
Ci
Ci
Sij
Sij
Cj
Cj
Square nodes are separators
89
Shafer-Shenoy algorithm

SS BP on a Jtree, but without separator nodes.
Multiplies by all-but-one messages instead of
dividing out by old beliefs.
Uses less space but more time than Hugin.

Lepar98
90
Other Jtree variants

Strong Jtree is created using a constrained
elimination order. Useful for
Decision (influence) diagrams
Hybrid (conditional Gaussian) networks
MAP (max-sum-product) inference
Lazy Jtree
Multiplies model potentials only when needed
Can exploit special structure in CPDs

Cowell99
Madsen99
91
Outline

Introduction
Exact inference
Variable elimination algorithm
Complexity of exact inference
Belief propagation algorithm
Junction tree algorithm
Linear Gaussian models
Approximate inference

92
Gaussian MRFs

Absent arcs correspond to 0s in the precision
matrix V-1 (structural zeros)

93
Gaussian Bayes nets
1
Shachter89
2
3
4
If each CPD has the form
Then the joint is
where
Absent arcs correspond to 0s in the regression
matrix B
94
Gaussian marginalization/ conditioning
If
then
and

Hence exact inference can be in O(N3) time
(matrix inversion).
Methods which exploit the sparsity structure of
? are equivalent tothe junction tree algorithm
and have complexity O(N w3), where w is
thetreewidth.

Paskin03b
95
Gaussian BP
Weiss01

Messages
Beliefs

96
Kalman filtering
Linear Dynamical System (LDS)/ State Space Model
(SSM)
Hidden state
Noisy observations
BP on an LDS model Kalman filtering/ smoothing
See my matlab toolbox
97
Example LDS for 2D tracking
Constant velocity model
Sparse linear Gaussian system ) sparse graphs
98
Non-linear, Gaussian models
X1
X3
Hidden state
X2
Y1
Y3
Y2
Noisy observations

Extended Kalman filter (EKF) linearize f/g
around
Unscented Kalman filter (UKF) more accurate
than EKFand avoids need to compute gradients
Both EKF and UKF assume P(Xty1t) is unimodal

See Rebel matlab toolbox (OGI)
99
Non-Gaussian models
Minka02