Sampling Bayesian Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Sampling Bayesian Networks

Description:

Generate random samples and compute values of interest from samples, not ... 'Clamping' evidence forward sampling weighing samples by evidence likelihood. 21 ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 90
Provided by: ValuedGate775
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Sampling Bayesian Networks


1
Sampling Bayesian Networks
  • ICS 275b
  • 2005

2
Approximation Algorithms
  • Structural Approximations
  • Eliminate some dependencies
  • Remove edges
  • Mini-Bucket Approach
  • Search
  • Approach for optimization tasks MPE, MAP
  • Sampling
  • Generate random samples and compute values of
    interest from samples,
  • not original network

3
Algorithm Tree
4
Sampling
  • Input Bayesian network with set of nodes X
  • Sample a tuple with assigned values
  • s(X1x1,X2x2, ,Xkxk)
  • Tuple may include all variables (except evidence)
    or a subset
  • Sampling schemas dictate how to generate samples
    (tuples)
  • Ideally, samples are distributed according to
    P(XE)

5
Sampling
  • Idea generate a set of samples T
  • Estimate P(XiE) from samples
  • Need to know
  • How to generate a new sample ?
  • How many samples T do we need ?
  • How to estimate P(XiE) ?

6
Sampling Algorithms
  • Forward Sampling
  • Likelyhood Weighting
  • Gibbs Sampling (MCMC)
  • Blocking
  • Rao-Blackwellised
  • Importance Sampling
  • Sequential Monte-Carlo (Particle Filtering) in
    Dynamic Bayesian Networks

7
Forward Sampling
  • Forward Sampling
  • Case with No evidence
  • Case with Evidence
  • N and Error Bounds

8
Forward Sampling No Evidence(Henrion 1988)
  • Input Bayesian network
  • X X1,,XN, N- nodes, T - samples
  • Output T samples
  • Process nodes in topological order first
    process the ancestors of a node, then the node
    itself
  • For t 0 to T
  • For i 0 to N
  • Xi ? sample xit from P(xi pai)

9
Sampling A Value
  • What does it mean to sample xit from P(Xi pai)
    ?
  • Assume D(Xi)0,1
  • Assume P(Xi pai) (0.3, 0.7)
  • Draw a random number r from 0,1
  • If r falls in 0,0.3, set Xi 0
  • If r falls in 0.3,1, set Xi1

10
Sampling a Value
  • When we sample xit from P(Xi pai),
  • most of the time, will pick the most likely
    value of Xi
  • occasionally, will pick the unlikely value of Xi
  • We want to find high-probability tuples
  • But!!!.
  • Choosing unlikely value allows to cross the low
    probability tuples to reach the high probability
    tuples !

11
Forward sampling (example)
12
Forward Sampling-Answering Queries
  • Task given n samples S1,S2,,Sn
  • estimate P(Xi xi)

Basically, count the proportion of samples where
Xi xi
13
Forward Sampling w/ Evidence
  • Input Bayesian network
  • X X1,,XN, N- nodes
  • E evidence, T - samples
  • Output T samples consistent with E
  • For t1 to T
  • For i1 to N
  • Xi ? sample xit from P(xi pai)
  • If Xi in E and Xi ? xi, reject sample
  • i 1 and go to step 2

14
Forward Sampling Illustration
Let Y be a subset of evidence nodes s.t. Yu
15
Forward Sampling How many samples?
  • Theorem Let ?s(y) be the estimate of P(y)
    resulting from a randomly chosen sample set S
    with T samples. Then, to guarantee relative error
    at most ? with probability at least 1-? it is
    enough to have

Derived from Chebychevs Bound.
16
Forward Sampling - How many samples?
  • Theorem Let ?s(y) be the estimate of P(y)
    resulting from a randomly chosen sample set S
    with T samples. Then, to guarantee relative error
    at most ? with probability at least 1-? it is
    enough to have

Derived from Hoeffdings Bound (full proof is
given in Koller).
17
Forward SamplingPerformance
  • Advantages
  • P(xi pa(xi)) is readily available
  • Samples are independent !
  • Drawbacks
  • If evidence E is rare (P(e) is low), then we will
    reject most of the samples!
  • Since P(y) in estimate of N is unknown, must
    estimate P(y) from samples themselves!
  • If P(e) is small, T will become very big!

18
Problem Evidence
  • Forward Sampling
  • High Rejection Rate
  • Fix evidence values
  • Gibbs sampling (MCMC)
  • Likelyhood Weighting
  • Importance Sampling

19
Forward Sampling Bibliography
  • henrion88 M. Henrion, "Propagating uncertainty
    in Bayesian networks by probabilistic logic
    sampling, Uncertainty in AI, pp. 149-163,1988

20
Likelihood Weighting(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidence forward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
21
Likelihood Weighting
22
Likelihood Weighting
where
23
Likelyhood Convergence(Chebychevs Inequality)
  • Assume P(Xxe) has mean ? and variance ?2
  • Chebychev

?P(xe) is unknown gt obtain it from samples!
24
Error Bound Derivation
K is a Bernoulli random variable
25
Likelyhood Convergence 2
  • Assume P(Xxe) has mean ? and variance ?2
  • Zero-One Estimation Theory (Karp et al.,1989)

?P(xe) is unknown gt obtain it from samples!
26
Local Variance Bound (LVB)(DagumLuby, 1994)
  • Let ? be LVB of a binary valued network

27
LVB Estimate(Pradhan,Dagum,1996)
  • Using the LVB, the Zero-One Estimator can be
    re-written

28
Importance Sampling Idea
  • In general, it is hard to sample from target
    distribution P(XE)
  • Generate samples from sampling (proposal)
    distribution Q(X)
  • Weigh each sample against P(XE)

29
Importance Sampling Variants
  • Importance sampling forward, non-adaptive
  • Nodes sampled in topological order
  • Sampling distribution (for non-instantiated
    nodes) equal to the prior conditionals
  • Importance sampling forward, adaptive
  • Nodes sampled in topological order
  • Sampling distribution adapted according to
    average importance weights obtained in previous
    samples Cheng,Druzdzel2000

30
AIS-BN
  • The most efficient variant of importance sampling
    to-date is AIS-BN Adaptive Importance Sampling
    for Bayesian networks.
  • Jian Cheng and Marek J. Druzdzel. AIS-BN An
    adaptive importance sampling algorithm for
    evidential reasoning in large Bayesian networks.
    Journal of Artificial Intelligence Research
    (JAIR), 13155-188, 2000.

31
Gibbs Sampling
  • Markov Chain Monte Carlo method
  • (Gelfand and Smith, 1990, Smith and Roberts,
    1993, Tierney, 1994)
  • Samples are dependent, form Markov Chain
  • Samples directly from P(Xe)
  • Guaranteed to converge when all P gt 0
  • Methods to improve convergence
  • Blocking
  • Rao-Blackwellised
  • Error Bounds
  • Lag-t autocovariance
  • Multiple Chains, Chebyshevs Inequality

32
MCMC Sampling Fundamentals
Given a set of variables X X1, X2, Xn that
represent joint probability distribution ?(X) and
some function g(X), we can compute expected value
of g(X)
33
MCMC Sampling From ?(X)
A sample St is an instantiation
Given independent, identically distributed
samples (iid) S1, S2, ST from ?(X), it follows
from Strong Law of Large Numbers
34
Gibbs Sampling (Pearl, 1988)
  • A sample t?1,2,,is an instantiation of all
    variables in the network
  • Sampling process
  • Fix values of observed variables e
  • Instantiate node values in sample x0 at random
  • Generate samples x1,x2,xT from P(xe)
  • Compute posteriors from samples

35
Ordered Gibbs Sampler
  • Generate sample xt1 from xt
  • In short, for i1 to N

Process All Variables In Some Order
36
Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
37
Ordered Gibbs Sampling Algorithm
  • Input X, E
  • Output T samples xt
  • Fix evidence E
  • Generate samples from P(X E)
  • For t 1 to T (compute samples)
  • For i 1 to N (loop through variables)
  • Xi ? sample xit from P(Xi markovt \ Xi)

38
Answering Queries
  • Query P(xi e) ?
  • Method 1 count of samples where Xixi
  • Method 2 average probability (mixture estimator)

39
Importance vs. Gibbs
wt
40
Gibbs Sampling Example - BN
  • X X1,X2,,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
41
Gibbs Sampling Example - BN
  • X1 x10 X6 x60
  • X2 x20 X7 x70
  • X3 x30 X8 x80
  • X4 x40
  • X5 x50

X1
X3
X6
X8
X5
X2
X9
X4
X7
42
Gibbs Sampling Example - BN
  • X1 ? P (X1 X02,,X08 ,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
43
Gibbs Sampling Example - BN
  • X2 ? P(X2 X11,,X08 ,X9
  • E X9

X1
X3
X6
X8
X5
X2
X9
X4
X7
44
Gibbs Sampling Illustration
45
Gibbs Sampling Example Init
  • Initialize nodes with random values
  • X1 x10 X6 x60
  • X2 x20 X7 x70
  • X3 x30 X8 x80
  • X4 x40
  • X5 x50
  • Initialize Running Sums
  • SUM1 0
  • SUM2 0
  • SUM3 0
  • SUM4 0
  • SUM5 0
  • SUM6 0
  • SUM7 0
  • SUM8 0

46
Gibbs Sampling Example Step 1
  • Generate Sample 1
  • compute SUM1 P(x1 x20, x30, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X1 x11
  • compute SUM2 P(x2 x11, x30, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X2 x21
  • compute SUM3 P(x2 x11, x21, x40, x50, x60,
    x70, x80, x9 )
  • select and assign new value X3 x31
  • ..
  • At the end, have new sample
  • S1 x11, x21, x41, x51, x61, x71, x81, x9

47
Gibbs Sampling Example Step 2
  • Generate Sample 2
  • Compute P(x1 x21, x31, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X1 x11
  • update SUM1 P(x1 x21, x31, x41, x51, x61,
    x71, x81, x9 )
  • Compute P(x2 x12, x31, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X2 x21
  • update SUM2 P(x2 x12, x31, x41, x51, x61,
    x71, x81, x9 )
  • Compute P(x3 x12, x22, x41, x51, x61, x71, x81,
    x9 )
  • select and assign new value X3 x31
  • compute SUM3 P(x2 x12, x22, x41, x51, x61,
    x71, x81, x9 )
  • ..
  • New sample S2 x12, x22, x42, x52, x62, x72,
    x82, x9

48
Gibbs Sampling Example Answering Queries
  • P(x1x9) SUM1 /2
  • P(x2x9) SUM2 /2
  • P(x3x9) SUM3 /2
  • P(x4x9) SUM4 /2
  • P(x5x9) SUM5 /2
  • P(x6x9) SUM6 /2
  • P(x7x9) SUM7 /2
  • P(x8x9) SUM8 /2

49
Gibbs Sampling Burn-In
  • We want to sample from P(X E)
  • Butstarting point is random
  • Solution throw away first K samples
  • Known As Burn-In
  • What is K ? Hard to tell. Use intuition.
  • Alternatives sample first sample valkues from
    approximate P(xe) (for example, run IBP first)

50
Gibbs Sampling Convergence
  • Converge to stationary distribution ?
  • ? ? P
  • where P is a transition kernel
  • pij P(Xi ? Xj)
  • Guaranteed to converge iff chain is
  • irreducible
  • aperiodic
  • ergodic ( ?i,j pij gt 0)

51
Irreducible
  • A Markov chain (or its probability transition
    matrix) is said to be irreducible if it is
    possible to reach every state from every other
    state (not necessarily in one step).
  • In other words, ?i,j ?k P(k)ij gt 0 where k is
    the number of steps taken to get to state j from
    state i.

52
Aperiodic
  • Define d(i) g.c.d.n gt 0 it is possible to go
    from i to i in n steps. Here, g.c.d. means the
    greatest common divisor of the integers in the
    set. If d(i)1 for ?i, then chain is aperiodic.

53
Ergodicity
  • A recurrent state is a state to which the chain
    returns with probability 1
  • ?n P(n)ij ?
  • Recurrent, aperiodic states are ergodic.
  • Note an extra condition for ergodicity is that
    expected recurrence time is finite. This holds
    for recurrent states in a finite state chain.

54
Gibbs Sampling Ergodicity
  • Convergence to the correct distribution is
    guaranteed only if network is ergodic
    transition from any state Si to any state Sj
    has non-zero probability
  • Intuition if ?i,j such that pij 0 , then we
    will not be able to explore full sampling space !

55
Gibbs Convergence
  • Gibbs convergence is generally guaranteed as long
    as all probabilities are positive!
  • Intuition for ergodicity requirement if nodes X
    and Y are correlated s.t. X0 ?Y0, then
  • once we sample and assign X0, then we are forced
    to assign Y0
  • once we sample and assign Y0, then we are forced
    to assign X0
  • ? we will never be able to change their values
    again!
  • Another problem it can take a very long time to
    converge!

56
Gibbs Sampling Performance
  • Advantage guaranteed to converge to P(XE)
  • -Disadvantage convergence may be slow
  • Problems
  • Samples are dependent !
  • Statistical variance is too big in
    high-dimensional problems

57
Gibbs Speeding Convergence
  • Objectives
  • Reduce dependence between samples
    (autocorrelation)
  • Skip samples
  • Randomize Variable Sampling Order
  • Reduce variance
  • Blocking Gibbs Sampling
  • Rao-Blackwellisation

58
Skipping Samples
  • Pick only every k-th sample (Gayer, 1992)
  • Can reduce dependence between samples !
  • Increases variance ! Waists samples !

59
Randomized Variable Order
  • Random Scan Gibbs Sampler
  • Pick each next variable Xi for update at random
    with probability pi , ?i pi 1.
  • (In the simplest case, pi are distributed
    uniformly.)
  • In some instances, reduces variance (MacEachern,
    Peruggia, 1999
  • Subsampling the Gibbs Sampler Variance
    Reduction)

60
Blocking
  • Sample several variables together, as a block
  • Example Given three variables X,Y,Z, with
    domains of size 2, group Y and Z together to form
    a variable WY,Z with domain size 4. Then,
    given sample (xt,yt,zt), compute next sample
  • Xt1 ? P(yt,zt)P(wt)
  • (yt1,zt1)Wt1 ? P(xt1)
  • Can improve convergence greatly when two
    variables are strongly correlated!
  • - Domain of the block variable grows
    exponentially with the variables in a block!

61
Blocking Gibbs Sampling
  • Jensen, Kong, Kjaerulff, 1993
  • Blocking Gibbs Sampling Very Large Probabilistic
    Expert Systems
  • Select a set of subsets
  • E1, E2, E3, , Ek s.t. Ei ? X
  • Ui Ei X
  • Ai X \ Ei
  • Sample P(Ei Ai)

62
Rao-Blackwellisation
  • Do not sample all variables!
  • Sample a subset!
  • Example Given three variables X,Y,Z, sample only
    X and Y, sum out Z. Given sample (xt,yt), compute
    next sample
  • Xt1 ? P(yt)
  • yt1 ? P(xt1)

63
Rao-Blackwell Theorem
Bottom line reducing number of variables in a
sample reduce variance!
64
Blocking vs. Rao-Blackwellisation
  • Standard Gibbs
  • P(xy,z),P(yx,z),P(zx,y) (1)
  • Blocking
  • P(xy,z), P(y,zx) (2)
  • Rao-Blackwellised
  • P(xy), P(yx) (3)
  • Var3 lt Var2 lt Var1
  • Liu, Wong, Kong, 1994
  • Covariance structure of the Gibbs sampler

X
Y
Z
65
Rao-Blackwellised Gibbs Cutset Sampling
  • Select C ? X (possibly cycle-cutset), C m
  • Fix evidence E
  • Initialize nodes with random values
  • For i1 to m ci to Ci c 0i
  • For t1 to n , generate samples
  • For i1 to m
  • Cicit1 ? P(cic1 t1,,ci-1 t1,ci1t,,cmt
    ,e)

66
Cutset Sampling
  • Select a subset CC1,,CK ? X
  • A sample t?1,2,,is an instantiation of C
  • Sampling process
  • Fix values of observed variables e
  • Generate sample c0 at random
  • Generate samples c1,c2,cT from P(ce)
  • Compute posteriors from samples

67
Cutset SamplingGenerating Samples
  • Generate sample ct1 from ct
  • In short, for i1 to K

68
Rao-Blackwellised Gibbs Cutset Sampling
  • How to compute P(cic t\ci, e) ?
  • Compute joint P(ci, c t\ci, e) for each ci ?
    D(Ci)
  • Then normalize
  • P(ci c t\ci , e) ? P(ci, c t\ci , e)
  • Computation efficiency depends
  • on choice of C

69
Rao-Blackwellised Gibbs Cutset Sampling
  • How to choose C ?
  • Special case C is cycle-cutset, O(N)
  • General case apply Bucket Tree Elimination
    (BTE), O(exp(w)) where w is the induced width of
    the network when nodes in C are observed.
  • Pick C wisely so as to minimize w ? notion of
    w-cutset

70
w-cutset Sampling
  • Cw-cutset of the network, a set of nodes such
    that when C and E are instantiated, the adjusted
    induced width of the network is w
  • Complexity of exact inference
  • bounded by w !
  • cycle-cutset is a special case

71
Cutset Sampling-Answering Queries
  • Query ?ci ?C, P(ci e)? same as Gibbs
  • Special case of w-cutset

computed while generating sample t
  • Query P(xi e) ?

compute after generating sample t
72
Cutset Sampling Example
X1
X2
X3
X5
X4
X6
X9
X7
X8
Ex9
73
Cutset Sampling Example
Sample a new value for X2
X1
X2
X3
X6
X5
X4
X9
X7
X8
74
Cutset Sampling Example
Sample a new value for X5
X1
X2
X3
X6
X5
X4
X9
X7
X8
75
Cutset Sampling Example
Query P(x2e) for sampling node X2
Sample 1
X1
X2
X3
Sample 2
X6
X5
X4
Sample 3
X9
X7
X8
76
Cutset Sampling Example
Query P(x3 e) for non-sampled node X3
X1
X2
X3
X6
X5
X4
X9
X7
X8
77
Gibbs Error Bounds
  • Objectives
  • Estimate needed number of samples T
  • Estimate error
  • Methodology
  • 1 chain ? use lag-k autocovariance
  • Estimate T
  • M chains ? standard sampling variance
  • Estimate Error

78
Gibbs lag-k autocovariance
Lag-k autocovariance
79
Gibbs lag-k autocovariance
Estimate Monte Carlo variance
Here, ? is the smallest positive integer
satisfying
Effective chain size
In absense of autocovariance
80
Gibbs Multiple Chains
  • Generate M chains of size K
  • Each chain produces independent estimate Pm

Estimate P(xie) as average of Pm (xie)
Treat Pm as independent random variables.
81
Gibbs Multiple Chains
  • Pm are independent random variables
  • Therefore

82
GemanGeman1984
  • Geman, S. Geman D., 1984. Stocahstic
    relaxation, Gibbs distributions, and the Bayesian
    restoration of images. IEEE Trans.Pat.Anal.Mach.In
    tel. 6, 721-41.
  • Introduce Gibbs sampling
  • Place the idea of Gibbs sampling in a general
    setting in which the collection of variables is
    structured in a graphical model and each variable
    has a neighborhood corresponding to a local
    region of the graphical structure. Geman and
    Geman use the Gibbs distribution to define the
    joint distribution on this structured set of
    variables.

83
TannerWong 1987
  • Tanner and Wong (1987)
  • Data-augmentation
  • Convergence Results

84
Pearl1988
  • Pearl,1988. Probabilistic Reasoning in
    Intelligent Systems, Morgan-Kaufmann.
  • In the case of Bayesian networks, the
    neighborhoods correspond to the Markov blanket of
    a variable and the joint distribution is defined
    by the factorization of the network.

85
GelfandSmith,1990
  • Gelfand, A.E. and Smith, A.F.M., 1990.
    Sampling-based approaches to calculating marginal
    densities. J. Am.Statist. Assoc. 85, 398-409.
  • Show variance reduction in using mixture
    estimator for posterior marginals.

86
Neal, 1992
  • R. M. Neal, 1992. Connectionist learning of
    belief networks, Artifical Intelligence, v. 56,
    pp. 71-118.
  • Stochastic simulation in noisy-or networks.

87
CPCS54 Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 54, D(Xi) 2, C 15, E
4 Exact Time 30 sec using Cutset Conditioning
88
CPCS179 Test Results
MSE vs. samples (left) and time (right)
Non-Ergodic (1 deterministic CPT entry) X
179, C 8, 2lt D(Xi)lt4, E 35 Exact Time
122 sec using Loop-Cutset Conditioning
89
CPCS360b Test Results
MSE vs. samples (left) and time (right)
Ergodic, X 360, D(Xi)2, C 21, E
36 Exact Time gt 60 min using Cutset
Conditioning Exact Values obtained via Bucket
Elimination
90
Random Networks
MSE vs. samples (left) and time (right) X
100, D(Xi) 2,C 13, E 15-20 Exact Time
30 sec using Cutset Conditioning
91
Coding Networks
x1
x1
x1
x1
u1
u2
u3
u4
p1
p2
p3
p4
y4
y3
y2
y1
MSE vs. time (right) Non-Ergodic, X 100,
D(Xi)2, C 13-16, E 50 Sample Ergodic
Subspace UU1, U2,Uk Exact Time 50 sec using
Cutset Conditioning
92
Non-Ergodic Hailfinder
MSE vs. samples (left) and time
(right) Non-Ergodic, X 56, C 5, 2 ltD(Xi)
lt11, E 0 Exact Time 2 sec using
Loop-Cutset Conditioning
93
Non-Ergodic CPCS360b - MSE
MSE vs. Time Non-Ergodic, X 360, C 26,
D(Xi)2 Exact Time 50 min using BTE
94
Non-Ergodic CPCS360b - MaxErr
Write a Comment
User Comments (0)
About PowerShow.com