Exact and approximate inference in probabilistic graphical models - PowerPoint PPT Presentation

1 / 99
About This Presentation
Title:

Exact and approximate inference in probabilistic graphical models

Description:

Exact and approximate inference in probabilistic graphical models – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 100
Provided by: KPM60
Category:

less

Transcript and Presenter's Notes

Title: Exact and approximate inference in probabilistic graphical models


1
Exact and approximate inference in probabilistic
graphical models
  • Kevin MurphyMIT CSAIL UBC CS/Stats

www.ai.mit.edu/murphyk/AAAI04
AAAI 2004 tutorial
2
Outline
  • Introduction
  • Exact inference
  • Approximate inference

3
Outline
  • Introduction
  • What are graphical models?
  • What is inference?
  • Exact inference
  • Approximate inference

4
Probabilistic graphical models
Probabilistic models
Graphical models
Undirected
Directed
(Markov Randomfields - MRFs)
(Bayesian networks)
5
Bayesian networks
  • Directed acyclic graph (DAG)
  • Nodes random variables
  • Edges direct influence (causation)
  • Xi ? Xancestors Xparents
  • e.g., C ? R,B,E A
  • Simplifies chain rule by using conditional
    independencies

Earthquake
Burglary
Alarm
Radio
Call
Pearl, 1988
6
Conditional probability distributions (CPDs)
Earthquake
Burglary
  • Each node specifies a distribution over its
    values given its parents values P(Xi XPai)
  • Full table needs 25-131parameters, BN needs 10

Alarm
Radio
Call
Pearl, 1988
7
Example BN Hidden Markov Model (HMM)
Hidden stateseg. words
Observationseg. sounds
8
CPDs for HMMs
1
2
3
Astate transition matrix
A
X1
X2
X3
?
Parameter tyeing
Y1
Y3
B
Y2
Transition matrix
Observation matrix
Initial state distribution
9
Markov Random Fields (MRFs)
  • Undirected graph
  • Xi ? Xrest Xnbrs
  • Each clique c has a potential function ?c

Hammersley-Clifford theorem P(X) (1/Z) ?c
yc(Xc)
The normalization constant (partition function) is
10
Potentials for MRFs
One potential per maximal clique
One potential per edge
?12
?123
?23
?13
?34
?35
?34
?35
11
Example MRF Ising/ Potts model
y
?
?
Parametertying
?
?
?
?
x
?
?
Compatibility with neighbors
Local evidence (compatibility with image)
12
Conditional Random Field (CRF)
Lafferty01,Kumar03,etc
y
?
?
Parametertying
?
?
?
?
x
?
?
Local evidence (compatibility with image)
Compatibility with neighbors
13
Directed vs undirected models
X1
X2
X1
X2
moralization
X3
X3
X5
X4
X5
X4
separation gt cond. indepence
d-separation gt cond. indepence
Parameter learning hard
Parameter learning easy
Inference is same!
14
Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
15
Outline
  • Introduction
  • What are graphical models?
  • What is inference?
  • Exact inference
  • Approximate inference

16
Inference (state estimation)
Earthquake
Burglary
Alarm
Radio
Call
Ct
17
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Call
Ct
18
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
Alarm
Radio
Rt
Call
Ct
19
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
20
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Explaining away effect
21
Inference
P(BtCt) 0.7
P(EtCt)0.1
Earthquake
Burglary
P(BtCt,Rt) 0.1
P(EtCt,Rt)0.97
Alarm
Radio
Rt
Call
Ct
Probability theory is nothing but common sense
reduced to calculation Pierre Simon Laplace
22
Inference tasks
  • Posterior probabilities of Query given Evidence
  • Marginalize out Nuisance variables
  • Sum-product
  • Most Probable Explanation (MPE)/ Viterbi
  • max-product
  • Marginal Maximum A Posteriori (MAP)
  • max-sum-product

23
Causal vs diagnostic reasoning
  • Sometimes easier to specify P(effectcause) than
    P(causeeffect) stable mechanism
  • Use Bayes rule to invert causal model

Diseases, H
Symptoms, v
24
Applications of Bayesian inference
25
Decision theory
  • Decision theory probability theory utility
    theory
  • Bayesian networks actions/utilities
    influence/ decision diagrams
  • Maximize expected utility

26
Outline
  • Introduction
  • Exact inference
  • Brute force enumeration
  • Variable elimination algorithm
  • Complexity of exact inference
  • Belief propagation algorithm
  • Junction tree algorithm
  • Linear Gaussian models
  • Approximate inference

27
Brute force enumeration
  • We can compute in O(KN) time, where KXi
  • By using BN, we can represent joint in O(N) space

B
E
A
M
J
28
Brute force enumeration
Russell Norvig
29
Enumeration tree
Russell Norvig
30
Enumeration tree contains repeated sub-expressions
31
Variable/bucket elimination
Kschischang01,Dechter96
  • Push sums inside products (generalized
    distributive law)
  • Carry out summations right to left, storing
    intermediate results (factors) to avoid
    recomputation (dynamic programming)

32
Variable elimination
33
VarElim basic operations
  • Pointwise product
  • Summing out

Only multiply factors which contain summand (lazy
evaluation)
34
Variable elimination
Russell Norvig
35
Outline
  • Introduction
  • Exact inference
  • Brute force enumeration
  • Variable elimination algorithm
  • Complexity of exact inference
  • Belief propagation algorithm
  • Junction tree algorithm
  • Linear Gaussian models
  • Approximate inference

36
VarElim on loopy graphs
Let us work right-to-left, eliminating variables,
and adding arcs to ensurethat any two terms that
co-occur in a factor are connected in the graph
2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
37
Complexity of VarElim
  • Time/space for single query O(N Kw1) for N
    nodes of K states, where ww(G, ?) width of
    graph induced by elimination order ?
  • w argmin? w(G,?) treewidth of G
  • Thm finding an order to minimize treewidth is
    NP-complete
  • Does there exist a more efficient exact inference
    algorithm?

Yannakakis81
38
Exact inference is P-complete
Dagum93
  • Can reduce 3SAT to exact inferencegt NP-hard
  • Equivalent to counting num. satisfying
    assignments gt P-complete

A
C
D
Literals
B
P(A)P(B)P(C)P(D)0.5 C1 A v B v C C2 C v
D v A C3 B v C v D S C1 C2 C3
Clauses
C1
C2
C3
Sentence
S
39
Summary so far
  • Brute force enumeration O(KN) time,O(N KC) space
    (where Cmax clique size)
  • VarElim O(N Kw1) time/space
  • w w(G,?) induced treewidth
  • Exact inference is P-complete
  • Motivates need for approximate inference

40
Treewidth
Low treewidth
High tree width
Chains
Nnxn grid
W 1
Trees (no loops)
Arnborg85
W O(n) O(p N)
Loopy graphs
W parents
W NP-hard to find
41
Graph triangulation
Golumbic80
  • A graph is triangulated (chordal, perfect) if it
    has no chordless cycles of length gt 3.
  • To triangulate a graph, for each node Xi in order
    ?, ensure all neighbors of Xi form a clique by
    adding fill-in edges then remove Xi

2
4
2
4
2
4
Elim 5
Elim 4
Elim 6
1
1
6
6
1
6
3
5
5
3
3
5
42
Graph triangulation
  • A graph is triangulated (chordal) if it has no
    chordless cycles of length gt 3

5
Not triangulated!
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 6-cycle
43
Graph triangulation
  • Triangulation is not just adding triangles

5
Still not triangulated
7
7
4
4
1
1
2
2
5
8
8
3
3
6
9
6
9
Chordless 4-cycle
44
Graph triangulation
  • Triangulation creates large cliques

5
Triangulated at last!
7
7
4
1
4
1
2
2
5
8
8
3
3
6
9
6
9
45
Finding an elimination order
  • The size of the induced clique depends on the
    elimination order.
  • Since this is NP-hard to optimize, it is common
    to apply greedy search techniques Kjaerulff90
  • At each iteration, eliminate the node that would
    result in the smallest
  • Num. fill-in edges min-fill
  • Resulting clique weight min-weight (Weight of
    clique product of number of states per node in
    clique)
  • There are some approximation algorithms

Amir01
46
Speedup tricks for VarElim
  • Remove nodes that are irrelevant to the query
  • Exploit special forms of P(XiX?i) to sum out
    variables efficiently

47
Irrelevant variables
B
E
A
M
J
  • M is irrelevant to computing P(jb)
  • Thm Xi irrelevant unlessXi 2 Ancestors(XQ
    XE)
  • Here, Ancestors(J B) A,E
  • gt hidden leaves (barren nodes) can always be
    removed

1
48
Irrelevant variables
B
E
A
M
J
  • M, B and E irrelevant to computing P(ja)
  • All variables relevant to a query can be found in
    O(N) time
  • Variable elimination supports query-specific
    optimizations

49
Structured CPDs
  • Sometimes P(XiXpii) has special structure, which
    we can exploit computationally
  • Context-specific independence (eg. CPDdecision
    tree)
  • Causal independence (eg CPDnoisy-OR)
  • Determinism
  • Such non-graphical structure complicates the
    search for the optimal triangulation

Boutilier96b,Zhang99
Rish98,Zhang96b
Zweig98,Bartels04
Bartels04
50
Outline
  • Introduction
  • Exact inference
  • Brute force enumeration
  • Variable elimination algorithm
  • Complexity of exact inference
  • Belief propagation algorithm
  • Junction tree algorithm
  • Linear Gaussian models
  • Approximate inference

51
Whats wrong with VarElim
  • Often we want to query all hidden nodes.
  • VarElim takes O(N2 Kw1) time to compute P(Xixe)
    for all (hidden) nodes i.
  • There exist message passing algorithms that can
    do this in O(N Kw1) time.
  • Later, we will use these to do approximate
    inference in O(N K2) time, indep of w.

X1
X3
X2
Y2
Y1
Y3
52
Repeated variable elimination leads to redundant
calculations
X1
X2
X3
Y1
Y3
Y2
O(N2 K2) time to compute all N marginals
53
Forwards-backwards algorithm
Rabiner89,etc
Xt1
Xt
Xt
Y1t-1
Yt1N
Yt
(Use dynamic programming to compute these)
Forwards prediction
Backwards prediction
Local evidence
54
Forwards algorithm (filtering)
Xt
Xt
Y1t-1
Yt
55
Backwards algorithm
Xt1
Xt2
Xt
Xt
Yt2N
Yt1
56
Forwards-backwards algorithm
?24
?1
?12

  • Forwards
  • Backwards
  • Combine

X24
X12
X1
b24
b12
b1
Backwards messages independent of forwards
messages
O(N K2) time to compute all N marginals, not O(N2
K2)
57
Belief propagation
Pearl88,Shafer90,Yedidia01,etc
  • Forwards-backwards algorithm can be generalized
    to apply to any tree-like graph (ones with no
    loops).
  • For now, we assume pairwise potentials.

58
Absorbing messages
Xt
Xt1
Xt-1
Yt
59
Sending messages
Xt
Xt1
Xt-1
Yt
60
Centralized protocol
Collect to root (post-order)
Distribute from root (pre-order)
R
R
3
3
4
2
1
5
4
5
1
2
Computes all N marginals in 2 passes over graph
61
Distributed protocol
Computes all N marginals in O(N) parallel updates
62
Loopy belief propagation
  • Applying BP to graphs with loops (cycles) can
    give the wrong answer, because it overcounts
    evidence
  • In practice, often works well (e.g., error
    correcting codes)

Cloudy
Sprinkler
Rain
WetGrass
63
Why Loopy BP?
  • We can compute exact answers by converting a
    loopy graph to a junction tree and running BP
    (see later).
  • However, the resulting Jtree has nodes with
    O(Kw1) states, so inference takes O(N Kw1) time
    wclique size of triangulated graph.
  • We can apply BP to the original graph inO(N KC)
    time C clique size of original graph.
  • To apply BP to a graph with non pairwise
    potentials, it is simpler to use factor graphs.

64
Factor graphs
Kschischang01
X1
X2
X1
X2
X1
X2
Pairwise Markov net
Markov net
Bayes net
X3
X3
X3
X5
X4
X5
X5
X4
X4
X2
X1
X2
X1
X2
X1
X3
X3
X3
X4
X5
X4
X4
X5
X5
Bipartite graph
65
BP for factor graphs
Kschischang01
  • Beliefs
  • Message variable to factor
  • Message factor to variable

f
x
f
f
x
f(x,y,x)
y
z
66
Sum-product vs max-product
  • Sum-product computes marginalsusing this rule
  • Max-product computes max marginals
  • using the rule
  • Same algorithm on different semirings (,x,0,1)
    and(max,x,-1,1)

Shafer90,Bistarelli97,Goodman99,Aji00
67
Viterbi decoding
Compute most probable explanation (MPE) of
observed data
Hidden Markov Model (HMM)
hidden
X1
X2
X3
Y1
Y3
observed
Y2
Tomato
68
Viterbi algorithm for HMMs
  • Run max forwards algorithm, keeping track of most
    probable predecessor for each state
  • Pointer traceback
  • Can produce N-best list (most probable
    configurations) in O(N T K2) time

Forney73,Nilsson01
69
Loopy Viterbi
  • Use max-product to compute/ approximate
  • If there are no ties and the max-marginals are
    exact, then
  • This method does not use traceback, so can be
    used with distributed/ loopy BP
  • We can break ties, and produce N most-probable
    configurations, by asserting that certain
    assignments are disallowed, and rerunning

Yanover04
70
BP speedup tricks
  • Sometimes we can reduce the time to compute a
    message from O(K2) to O(K)
  • If ?(xi,xj) exp(f(xi) f(xj)2), then
  • Sum-product in O(K log K) time exact FFTor
    O(K) time approx
  • Max-product in O(K) time distance transform
  • For general (discrete) potentials, we can
    dynamically add/delete states to reduce K
  • Sometimes we can speedup convergence by
  • Using a better message-passing schedule (e.g.,
    along embedded spanning trees)
  • Using a multiscale method

Felzenszwalb03/04,Movellan04,deFreitas04
Coughlan04
Wainwright01
Felzenszwalb04
71
Outline
  • Introduction
  • Exact inference
  • Variable elimination algorithm
  • Complexity of exact inference
  • Belief propagation algorithm
  • Junction tree algorithm
  • Linear Gaussian models
  • Approximate inference

72
Junction/ join/ clique trees
  • To perform exact inference in an arbitrary graph,
    convert it to a junction tree, and then perform
    belief propagation.
  • A jtree is a tree whose nodes are sets, and which
    has the Jtree property all sets which contain
    any given variable form a connected graph
    (variable cannot appear in 2 disjoint places)

CSR
C
moralize
Make jtree
S
R
SR
W
Maximal cliques C,S,R, S,R,W
SRW
Separators C,S,R Å S,R,W S,R
73
Making a junction tree
GM
G
B
D
moralize
A
F
C
E
Triangulate (order f,d,e,c,b,a)
Wij Ci Å Cj
a,b,c
1
b,d
b,e,f
Jensen94
B
D
Max spanning tree
Findmax cliques
1
b,c,e
1
1
b,d
A
2
F
b,c,e
a,b,c
2
b,e,f
C
E
Jgraph
Jtree
GT
74
Clique potentials
CSR
Each model clique potential gets assigned to
oneJtree clique potential
SR
Each observed variable assigns a delta
functionto one Jtree clique potential
If we observe Ww, set E(w)?(w,w), else E(w)1
SRW
Square nodes are factors
75
Separator potentials
CSR
Separator potentials enforce consistency
betweenneighboring cliques on common variables.
SR
SRW
Square nodes are factors
76
BP on a Jtree
  • A Jtree is a MRF with pairwise potentials.
  • Each (clique) node potential contains CPDs and
    local evidence.
  • Each edge potential acts like a projection
    function.
  • We do a forwards (collect) pass, then a backwards
    (distribute) pass.
  • The result is the Hugin/ Shafer-Shenoy algorithm.

CSR
1
4
SR
2
3
SRW
77
BP on a Jtree (collect)
CSR
Initial clique potentials contain CPDs and
evidence
SR
SRW
78
BP on a Jtree (collect)
CSR
Message from clique to separator marginalizes
belief (projects onto intersection)remove c
SR
SRW
79
BP on a Jtree (collect)
CSR
SR
Separator potentials gets marginal belieffrom
their parent clique.
SRW
80
BP on a Jtree (collect)
CSR
SR
SRW
Message from separator to clique expands marginal
add w
81
BP on a Jtree (collect)
CSR
SR
SRW
Root clique has seen all the evidence
82
BP on a Jtree (distribute)
83
BP on a Jtree (distribute)
CSR
CSR
Marginalize out w and excludeold evidence (ec,
er)
SR
SR
SRW
SRW
84
BP on a Jtree (distribute)
CSR
CSR
Combine upstream and downstreamevidence
SR
SR
SRW
SRW
85
BP on a Jtree (distribute)
Add c and excludeold evidence (ec, er)
CSR
CSR
SR
SR
SRW
SRW
86
BP on a Jtree (distribute)
CSR
CSR
SR
SR
SRW
SRW
Combine upstream and downstream evidence
87
Partial beliefs
CSR
CSR
SR
SR
SRW
SRW
Evidence on R now added here
  • The beliefs/ messages at intermediate stages
    (before finishing both passes)may not be
    meaningful, because any given clique may not have
    seen all themodel potentials/ evidence (and
    hence may not be normalizable).
  • This can cause problems when messages may fail
    (eg. Sensor nets).
  • One must reparameterize using the decomposable
    model to ensure meaningfulpartial beliefs.

Paskin04
88
Hugin algorithm
Hugin BP applied to a Jtree using a serial
protocol
Collect
Distribute
Ci
Ci
Sij
Sij
Cj
Cj
Square nodes are separators
89
Shafer-Shenoy algorithm
  • SS BP on a Jtree, but without separator nodes.
  • Multiplies by all-but-one messages instead of
    dividing out by old beliefs.
  • Uses less space but more time than Hugin.

Lepar98
90
Other Jtree variants
  • Strong Jtree is created using a constrained
    elimination order. Useful for
  • Decision (influence) diagrams
  • Hybrid (conditional Gaussian) networks
  • MAP (max-sum-product) inference
  • Lazy Jtree
  • Multiplies model potentials only when needed
  • Can exploit special structure in CPDs

Cowell99
Madsen99
91
Outline
  • Introduction
  • Exact inference
  • Variable elimination algorithm
  • Complexity of exact inference
  • Belief propagation algorithm
  • Junction tree algorithm
  • Linear Gaussian models
  • Approximate inference

92
Gaussian MRFs
  • Absent arcs correspond to 0s in the precision
    matrix V-1 (structural zeros)

93
Gaussian Bayes nets
1
Shachter89
2
3
4
If each CPD has the form
Then the joint is
where
Absent arcs correspond to 0s in the regression
matrix B
94
Gaussian marginalization/ conditioning
If
then
and
  • Hence exact inference can be in O(N3) time
    (matrix inversion).
  • Methods which exploit the sparsity structure of
    ? are equivalent tothe junction tree algorithm
    and have complexity O(N w3), where w is
    thetreewidth.

Paskin03b
95
Gaussian BP
Weiss01
  • Messages
  • Beliefs

96
Kalman filtering
Linear Dynamical System (LDS)/ State Space Model
(SSM)
Hidden state
Noisy observations
BP on an LDS model Kalman filtering/ smoothing
See my matlab toolbox
97
Example LDS for 2D tracking
Constant velocity model
Sparse linear Gaussian system ) sparse graphs
98
Non-linear, Gaussian models
X1
X3
Hidden state
X2
Y1
Y3
Y2
Noisy observations
  • Extended Kalman filter (EKF) linearize f/g
    around
  • Unscented Kalman filter (UKF) more accurate
    than EKFand avoids need to compute gradients
  • Both EKF and UKF assume P(Xty1t) is unimodal

See Rebel matlab toolbox (OGI)
99
Non-Gaussian models
Minka02
  • If P(Xty1t) is multi-modal, we can
    approximately represent it using
  • Mixtures of Gaussians (assumed density filtering
    (ADF))
  • Samples (particle filtering)
  • Batch (offline) versions
  • ADF -gt expectation propagation (EP)
  • Particle filtering -gt particle smoothing or Gibbs
    sampling
Write a Comment
User Comments (0)
About PowerShow.com