Title: A Tutorial on Inference and Learning in Bayesian Networks
1 A Tutorial on Inference and Learning in
Bayesian Networks
 Irina Rish Moninder Singh
 IBM T.J.Watson Research Center
 rish,moninder_at_us.ibm.com
2Road map
 Introduction Bayesian networks
 What are BNs representation, types, etc
 Why use BNs Applications (classes) of BNs
 Information sources, software, etc
 Probabilistic inference
 Exact inference
 Approximate inference
 Learning Bayesian Networks
 Learning parameters
 Learning graph structure
 Summary
3 Bayesian Networks
P(A) P(S) P(TA) P(LS) P(BS)
P(CT,L) P(DT,L,B)
P(A, S, T, L, B, C, D)
Lauritzen Spiegelhalter, 95
4 Bayesian Networks
 Structured, graphical representation of
probabilistic relationships between several
random variables  Explicit representation of conditional
independencies  Missing arcs encode conditional independence
 Efficient representation of joint pdf
 Allows arbitrary queries to be answered
P (lung canceryes smokingno, dyspnoeayes )
?
5 Example Printer Troubleshooting (Microsoft
Windows 95)
Heckerman, 95
6Example Microsoft Pregnancy and Child Care)
Heckerman, 95
7Example Microsoft Pregnancy and Child Care)
Heckerman, 95
8Independence Assumptions
9Independence Assumptions
 Nodes X and Y are dconnected by nodes in Z along
a trail from X to Y if  every headtohead node along the trail is in Z
or has a descendant in Z  every other node along the trail is not in Z
 Nodes X and Y are dseparated by nodes in Z if
they are not dconnected by Z along any trail
from X to Y
 Nodes X and Y are dseparated by Z implies X and
Y are conditionally independent given Z
10Independence Assumptions
 A variable (node) is conditionally independent of
its  nondescendants given its parents
Smoking
Visit to Asia
Lung Cancer
Bronchitis
Tuberculosis
Chest Xray
Dyspnoea
11Independence Assumptions
Cancer is independent of Diet given Exposure to
Toxins and Smoking
Breese Koller, 97
12Independence Assumptions
What this means is that joint pdf can be
represented as product of local
distributions P(A,S,T,L,B,C,D) P(A) . P(SA) .
P(TA,S) . P(LA,S,T) . P(BA,S,T,L) .
P(CA,S,T,L,B) . P(DA,S,T,L,B,C)
P(A) . P(S) . P(TA) .
P(LS) .P(BS) . P(CT,L) . P(DT,L,B)
13Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is P(X1,X2,,Xn) P P(Xi
Pa(Xi)) where Pa(Xi) is the
set of parents of Xi
n
i1
14The Knowledge Acquisition Task
 Variables
 collectively exhaustive, mutually exclusive
values  clarity test value should be knowable in
principle  Structure
 if data available, can be learned
 constructed by hand (using expert knowledge)
 variable ordering matters causal knowledge
usually simplifies  Probabilities
 can be learned from data
 second decimal usually does not matter relative
probs  sensitivity analysis
15The Knowledge Acquisition Task
16The Knowledge Acquisition Task
Naive Baysian Classifiers DudaHart Langley
92 Selective Naive Bayesian Classifiers
Langley Sage 94 Conditional Trees Geiger 92
Friedman et al 97
17The Knowledge Acquisition Task
Selective Bayesian Networks Singh Provan,
9596
18What are BNs useful for?
 Diagnosis P(causesymptom)?
 Prediction P(symptomcause)?
 Decisionmaking (given a cost function)
 Data mining induce best model from data
19What are BNs useful for?
Cause
Decision Making  Max. Expected Utility
Predictive Inference
Effect
20What are BNs useful for?
Value of Information
Salient Observations
Fault 1 Fault 2 Fault 3 . . .
Assignment of Belief
New Obs.
Act Now!
Halt?
Yes
No
Next Best Observation (Value of Information)
21Why use BNs?
 Explicit management of uncertainty
 Modularity implies maintainability
 Better, flexible and robust decision making 
MEU, VOI  Can be used to answer arbitrary queries 
multiple fault problems  Easy to incorporate prior knowledge
 Easy to understand
22Application Examples
 Intellipath
 commercial version of Pathfinder
 lymphnode diseases (60), 100 findings
 APRI system developed at ATT Bell Labs
 learns uses Bayesian networks from data to
identify customers liable to default on bill
payments  NASA Vista system
 predict failures in propulsion systems
 considers time criticality suggests highest
utility action  dynamically decide what information to show
23Application Examples
 Answer Wizard in MS Office 95/ MS Project
 Bayesian network based freetext help facility
 uses naive Bayesian classifiers
 Office Assistant in MS Office 97
 Extension of Answer wizard
 uses naïve Bayesian networks
 help based on past experience (keyboard/mouse
use) and task user is doing currently  This is the smiley face you get in your MS
Office applications
24Application Examples
 Microsoft Pregnancy and ChildCare
 Available on MSN in Health section
 Frequently occuring childrens symptoms are
linked to expert modules that repeatedly ask
parents relevant questions  Asks next best question based on provided
information  Presents articles that are deemed relevant based
on information provided
25Application Examples
 Printer troubleshooting
 HP bought 40 stake in HUGIN. Developing printer
troubleshooters for HP printers  Microsoft has 70 online troubleshooters on their
web site  use Bayesian networks  multiple faults models,
incorporate utilities  Fax machine troubleshooting
 Ricoh uses Bayesian network based troubleshooters
at call centers  Enabled Ricoh to answer twice the number of calls
in half the time
26Application Examples
27Application Examples
28Application Examples
29Online/print resources on BNs
 Conferences Journals
 UAI, ICML, AAAI, AISTAT, KDD
 MLJ, DMKD, JAIR, IEEE KDD, IJAR, IEEE PAMI
 Books and Papers
 Bayesian Networks without Tears by Eugene
Charniak. AI Magazine Winter 1991.  Probabilistic Reasoning in Intelligent Systems by
Judea Pearl. Morgan Kaufmann 1998.  Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley 1990.  CACM special issue on Realworld applications of
BNs, March 1995
30Online/Print Resources on BNs
 Wealth of online information at www.auai.org
Links to  Electronic proceedings for UAI conferences
 Other sites with information on BNs and reasoning
under uncertainty  Several tutorials and important articles
 Research groups companies working in this area
 Other societies, mailing lists and conferences
31Publicly available s/w for BNs
 List of BN software maintained by Russell Almond
at bayes.stat.washington.edu/almond/belief.html  several free packages generally research only
 commercial packages most powerful ( expensive)
is HUGIN others include Netica and Dxpress  we are working on developing a Java based BN
toolkit here at Watson  will also work within
ABLE
32Road map
 Introduction Bayesian networks
 What are BNs representation, types, etc
 Why use BNs Applications (classes) of BNs
 Information sources, software, etc
 Probabilistic inference
 Exact inference
 Approximate inference
 Learning Bayesian Networks
 Learning parameters
 Learning graph structure
 Summary
33Probabilistic Inference Tasks
 Belief updating
 Finding most probable explanation (MPE)
 Finding maximum aposteriory hypothesis
 Finding maximumexpectedutility (MEU) decision
34 Belief Updating
Smoking
lung Cancer
Bronchitis
Xray
Dyspnoea
P (lung canceryes smokingno, dyspnoeayes )
?
35 Belief updating P(Xevidence)?
P(ae0)
B
C
E
D
P(a)
36 Bucket elimination Algorithm elimbel (Dechter
1996)
37 Finding Algorithm elimmpe (Dechter 1996)
Elimination operator
38Generating the MPEtuple
39Complexity of inference
The effect of the ordering
40Other tasks and algorithms
 MAP and MEU tasks
 Similar bucketelimination algorithms  elimmap,
elimmeu (Dechter 1996)  Elimination operation either summation or
maximization  Restriction on variable ordering summation must
precede maximization (i.e. hypothesis or decision
variables are eliminated last)  Other inference algorithms
 Jointree clustering
 Pearls polytree propagation
 Conditioning, etc.
41Relationship with jointree
clustering
BCE
ADB
A cluster is a set of buckets (a
superbucket)
ABC
42Relationship with Pearls belief propagation in
polytrees
Causal support
Diagnostic support
Pearls belief propagation for
singleroot query
elimbel using topological ordering and
superbuckets for families
Elimbel, elimmpe, and elimmap are linear for
polytrees.
43Road map
 Introduction Bayesian networks
 Probabilistic inference
 Exact inference
 Approximate inference
 Learning Bayesian Networks
 Learning parameters
 Learning graph structure
 Summary
44 Inference is NPhard gt approximations
 Approximations
 Local inference
 Stochastic simulations
 Variational approximations
 etc.
45Local Inference Idea
46Bucketelimination approximation minibuckets
 Local inference idea
 bound the size of recorded dependencies
 Computation in a bucket is time and space
 exponential in the number of variables
involved  Therefore, partition functions in a bucket
 into minibuckets on smaller number of
variables 
47Minibucket approximation MPE task
Split a bucket into minibuckets gtbound
complexity
48Approxmpe(i)
 Input i max number of variables allowed in a
minibucket  Output lower bound (P of a suboptimal
solution), upper bound 
Example approxmpe(3) versus elimmpe
49Properties of approxmpe(i)
 Complexity O(exp(2i)) time and O(exp(i))
time.  Accuracy determined by upper/lower (U/L) bound.
 As i increases, both accuracy and complexity
increase.  Possible use of minibucket approximations
 As anytime algorithms (Dechter and Rish, 1997)
 As heuristics in bestfirst search (Kask and
Dechter, 1999)  Other tasks similar minibucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997) 
50Anytime Approximation
51Empirical Evaluation(Dechter and Rish, 1997
Rish, 1999)
 Randomly generated networks
 Uniform random probabilities
 Random noisyOR
 CPCS networks
 Probabilistic decoding
 Comparing approxmpe and anytimempe
 versus elimmpe
52Random networks
 Uniform random 60 nodes, 90 edges (200
instances)  In 80 of cases, 10100 times speedup while
U/Llt2  NoisyOR even better results
 Exact elimmpe was infeasible appproxmpe took
0.1 to 80 sec.
53CPCS networks medical diagnosis(noisyOR model)
Test case no evidence
54Effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
55Probabilistic decoding
Errorcorrecting linear block code
Stateoftheart
approximate algorithm iterative belief
propagation (IBP) (Pearls polytree algorithm
applied to loopy networks)
56approxmpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
57Minibuckets summary
 Minibuckets local inference approximation
 Idea bound size of recorded functions
 Approxmpe(i)  minibucket algorithm for MPE
 Better results for noisyOR than for random
problems  Accuracy increases with decreasing noise in
 Accuracy increases for likely evidence
 Sparser graphs gt higher accuracy
 Coding networks approxmpe outperfroms IBP on
lowinduced width codes
58Road map
 Introduction Bayesian networks
 Probabilistic inference
 Exact inference
 Approximate inference
 Local inference
 Stochastic simulations
 Variational approximations
 Learning Bayesian Networks
 Summary
59Approximation via Sampling
60Forward Sampling(logic sampling (Henrion, 1988))
61Forward sampling (example)
Drawback high rejection rate!
62Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
63Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
Advantage guaranteed to converge to
P(X) Disadvantage convergence may be slow
64Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
65Road map
 Introduction Bayesian networks
 Probabilistic inference
 Exact inference
 Approximate inference
 Local inference
 Stochastic simulations
 Variational approximations
 Learning Bayesian Networks
 Summary
66Variational Approximations
 Idea
 variational transformation of CPDs simplifies
inference  Advantages
 Compute upper and lower bounds on P(Y)
 Usually faster than sampling techniques
 Disadvantages
 More complex and less general must be derived
for each particular form of CPD functions
67Variational bounds example
log(x)
This approach can be generalized for any concave
(convex) function in order to compute its
upper (lower) bounds convex duality
(Jaakkola and Jordan, 1997)
68Convex duality (Jaakkola and Jordan, 1997)
69Example QMRDT network(Quick Medical Reference
DecisionTheoretic (Shwe et al., 1991))
600 diseases
4000 findings
NoisyOR model
70Inference in QMRDT
factorized
Positive evidence couples the disease nodes
factorized
Inference complexity O(exp(minp,k)) p
of positive findings, k max family
size (Heckerman, 1989 (Quickscore), Rish and
Dechter, 1998)
71Variational approach to QMRDT(Jaakkola and
Jordan, 1997)
The effect of positive evidence is now factorized
(diseases are decoupled)
72Variational approximations
 Bounds on local CPDs yield a bound on posterior
 Two approaches sequential and block
 Sequential applies variational transformation to
(a subset of) nodes sequentially during inference
using a heuristic node ordering then optimizes
across variational parameters  Block selects in advance nodes to be
transformed, then selects variational parameters
minimizing the KLdistance between true and
approximate posteriors
73Block approach
74Inference in BN summary
 Exact inference is often intractable gt need
approximations  Approximation principles
 Approximating elimination local inference,
bounding size of dependencies among variables
(cliques in a problems graph).  Minibuckets, IBP
 Other approximations stochastic simulations,
variational techniques, etc.  Further research
 Combining orthogonal approximation approaches
 Better understanding of what works well where
which approximation suits which problem structure  Other approximation paradigms (e.g., other ways
of approximating probabilities, constraints, cost
functions) 
75Road map
 Introduction Bayesian networks
 Probabilistic inference
 Exact inference
 Approximate inference
 Learning Bayesian Networks
 Learning parameters
 Learning graph structure
 Summary
76Why learn Bayesian networks?
 Efficient representation and inference
 Handling missing data lt1.3 2.8 ?? 0 1 gt
77 Learning Bayesian Networks
78Learning Parameterscomplete data
79Learning graph structure
Complete data local computations Incomplete
data (score nondecomposable)stochastic methods
 Constrainedbased methods
 Data impose independence
 relations (constraints)
80Learning BNs incomplete data
 Learning parameters
 EM algorithm Lauritzen, 95
 Gibbs Sampling Heckerman, 96
 Gradient Descent Russell et al., 96
 Learning both structure and parameters
 Sum over missing values Cooper Herskovits, 92
Cooper, 95  MonteCarlo approaches Heckerman, 96
 Gaussian approximation Heckerman, 96
 Structural EM Friedman, 98
 EM and Multiple Imputation Singh 97,98,00
81Learning Parametersincomplete data
EMalgorithm iterate until convergence
82Learning Parametersincomplete data
(Lauritzen, 95)
 Completedata loglikelihood is

 E step
 Compute E( Nijk Yobs, ??
 M step
 Compute
 ???????????E( Nijk Yobs, ?????E( Nij Yobs, ???
83Learning structure incomplete data
 Depends on the type of missing data  missing
independent of anything else (MCAR) OR missing
based on values of other variables (MAR)  While MCAR can be resolved by decomposable
scores, MAR cannot  For likelihoodbased methods, no need to
explicitly model missing data mechanism  Very few attempts at MAR stochastic methods

84Learning structure incomplete data
 Approximate EM by using Multiple Imputation to
yield efficient MonteCarlo method  Singh 97, 98, 00
 tradeoff between performance quality
 learned network almost optimal
 approximate completedata loglikelihood function
using Multiple Imputation  yields decomposable score, dependent only on each
node its parents  converges to local maxima of observeddata
likelihood 
85Learning structure incomplete data
86Scoring functionsMinimum Description Length
(MDL)
 Learning ? data compression

 Other MDL BIC (Bayesian Information
Criterion)  Bayesian score (BDe)  asymptotically equivalent
to MDL
DL(Model)
DL(Datamodel)
87Learning Structure plus Parameters
No. of models is super exponential Alternatives
Model Selection or Model Averaging
88Model Selection
Generally, choose a single model M. Equivalent
to saying P(MD) 1
Task is now to 1) define a metric to decide
which model is
best 2) search for that
model through the
space of all models
89One Reasonable ScorePosterior Probability of a
Structure
structure prior
parameter prior
likelihood
90Global and Local Predictive Scores
Spiegelhalter et al 93
Bayes factor
m
å
...
h
h
log
(
)
log
(
,
,
,
)
p
D
S
p
S
x
x
x

l
l
1
1
l
1
L
h
h
h
log
(
)
log
(
,
)
log
(
,
,
)
p
S
p
S
p
S
x
x
x
x
x
x
1
2
1
3
1
2
Local is useful for diagnostic problems
91Local Predictive ScoreSpiegelhalter et al. (1993)
92Exact computation of p(DSh)
 No missing data
 Cases are independent, given the model.
 Uniform priors on parameters
 discrete variables
Cooper Herskovits, 92
93Bayesian Dirichlet ScoreCooper and Herskovits
(1991)
94 Learning BNs without specifying an ordering
 n! ordering ordering greatly affects the quality
of network learned.  use conditional independence tests, and
dseparation to get an ordering
Singh Valtorta 95
95 Learning BNs via the MDL principle
 Idea best model is that which gives the most
compact representation of the data  So, encode the data using the model plus encode
the model. Minimize this.
Lam Bacchus, 93
96Learning BNs summary
 Bayesian Networks graphical probabilistic
models  Efficient representation and inference
 Expert knowledge learning from data
 Learning
 parameters (parameter estimation, EM)
 structure (optimization w/ score functions
e.g., MDL)  Applications/systems collaborative filtering
(MSBN), fraud detection (ATT), classification
(AutoClass (NASA), TANBLT(SRI))  Future directions causality, time, model
evaluation criteria, approximate
inference/learning, online learning, etc.