Title: A Tutorial on Inference and Learning in Bayesian Networks
1 A Tutorial on Inference and Learning in
Bayesian Networks
- Irina Rish Moninder Singh
- IBM T.J.Watson Research Center
- rish,moninder_at_us.ibm.com
2Road map
- Introduction Bayesian networks
- What are BNs representation, types, etc
- Why use BNs Applications (classes) of BNs
- Information sources, software, etc
- Probabilistic inference
- Exact inference
- Approximate inference
- Learning Bayesian Networks
- Learning parameters
- Learning graph structure
- Summary
3 Bayesian Networks
P(A) P(S) P(TA) P(LS) P(BS)
P(CT,L) P(DT,L,B)
P(A, S, T, L, B, C, D)
Lauritzen Spiegelhalter, 95
4 Bayesian Networks
- Structured, graphical representation of
probabilistic relationships between several
random variables - Explicit representation of conditional
independencies - Missing arcs encode conditional independence
- Efficient representation of joint pdf
- Allows arbitrary queries to be answered
P (lung canceryes smokingno, dyspnoeayes )
?
5 Example Printer Troubleshooting (Microsoft
Windows 95)
Heckerman, 95
6Example Microsoft Pregnancy and Child Care)
Heckerman, 95
7Example Microsoft Pregnancy and Child Care)
Heckerman, 95
8Independence Assumptions
9Independence Assumptions
- Nodes X and Y are d-connected by nodes in Z along
a trail from X to Y if - every head-to-head node along the trail is in Z
or has a descendant in Z - every other node along the trail is not in Z
- Nodes X and Y are d-separated by nodes in Z if
they are not d-connected by Z along any trail
from X to Y
- Nodes X and Y are d-separated by Z implies X and
Y are conditionally independent given Z
10Independence Assumptions
- A variable (node) is conditionally independent of
its - non-descendants given its parents
Smoking
Visit to Asia
Lung Cancer
Bronchitis
Tuberculosis
Chest X-ray
Dyspnoea
11Independence Assumptions
Cancer is independent of Diet given Exposure to
Toxins and Smoking
Breese Koller, 97
12Independence Assumptions
What this means is that joint pdf can be
represented as product of local
distributions P(A,S,T,L,B,C,D) P(A) . P(SA) .
P(TA,S) . P(LA,S,T) . P(BA,S,T,L) .
P(CA,S,T,L,B) . P(DA,S,T,L,B,C)
P(A) . P(S) . P(TA) .
P(LS) .P(BS) . P(CT,L) . P(DT,L,B)
13Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is P(X1,X2,,Xn) P P(Xi
Pa(Xi)) where Pa(Xi) is the
set of parents of Xi
n
i1
14The Knowledge Acquisition Task
- Variables
- collectively exhaustive, mutually exclusive
values - clarity test value should be knowable in
principle - Structure
- if data available, can be learned
- constructed by hand (using expert knowledge)
- variable ordering matters causal knowledge
usually simplifies - Probabilities
- can be learned from data
- second decimal usually does not matter relative
probs - sensitivity analysis
15The Knowledge Acquisition Task
16The Knowledge Acquisition Task
Naive Baysian Classifiers DudaHart Langley
92 Selective Naive Bayesian Classifiers
Langley Sage 94 Conditional Trees Geiger 92
Friedman et al 97
17The Knowledge Acquisition Task
Selective Bayesian Networks Singh Provan,
9596
18What are BNs useful for?
- Diagnosis P(causesymptom)?
- Prediction P(symptomcause)?
- Decision-making (given a cost function)
- Data mining induce best model from data
19What are BNs useful for?
Cause
Decision Making - Max. Expected Utility
Predictive Inference
Effect
20What are BNs useful for?
Value of Information
Salient Observations
Fault 1 Fault 2 Fault 3 . . .
Assignment of Belief
New Obs.
Act Now!
Halt?
Yes
No
Next Best Observation (Value of Information)
21Why use BNs?
- Explicit management of uncertainty
- Modularity implies maintainability
- Better, flexible and robust decision making -
MEU, VOI - Can be used to answer arbitrary queries -
multiple fault problems - Easy to incorporate prior knowledge
- Easy to understand
22Application Examples
- Intellipath
- commercial version of Pathfinder
- lymph-node diseases (60), 100 findings
- APRI system developed at ATT Bell Labs
- learns uses Bayesian networks from data to
identify customers liable to default on bill
payments - NASA Vista system
- predict failures in propulsion systems
- considers time criticality suggests highest
utility action - dynamically decide what information to show
23Application Examples
- Answer Wizard in MS Office 95/ MS Project
- Bayesian network based free-text help facility
- uses naive Bayesian classifiers
- Office Assistant in MS Office 97
- Extension of Answer wizard
- uses naïve Bayesian networks
- help based on past experience (keyboard/mouse
use) and task user is doing currently - This is the smiley face you get in your MS
Office applications
24Application Examples
- Microsoft Pregnancy and Child-Care
- Available on MSN in Health section
- Frequently occuring childrens symptoms are
linked to expert modules that repeatedly ask
parents relevant questions - Asks next best question based on provided
information - Presents articles that are deemed relevant based
on information provided
25Application Examples
- Printer troubleshooting
- HP bought 40 stake in HUGIN. Developing printer
troubleshooters for HP printers - Microsoft has 70 online troubleshooters on their
web site - use Bayesian networks - multiple faults models,
incorporate utilities - Fax machine troubleshooting
- Ricoh uses Bayesian network based troubleshooters
at call centers - Enabled Ricoh to answer twice the number of calls
in half the time
26Application Examples
27Application Examples
28Application Examples
29Online/print resources on BNs
- Conferences Journals
- UAI, ICML, AAAI, AISTAT, KDD
- MLJ, DMKD, JAIR, IEEE KDD, IJAR, IEEE PAMI
- Books and Papers
- Bayesian Networks without Tears by Eugene
Charniak. AI Magazine Winter 1991. - Probabilistic Reasoning in Intelligent Systems by
Judea Pearl. Morgan Kaufmann 1998. - Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley 1990. - CACM special issue on Real-world applications of
BNs, March 1995
30Online/Print Resources on BNs
- Wealth of online information at www.auai.org
Links to - Electronic proceedings for UAI conferences
- Other sites with information on BNs and reasoning
under uncertainty - Several tutorials and important articles
- Research groups companies working in this area
- Other societies, mailing lists and conferences
31Publicly available s/w for BNs
- List of BN software maintained by Russell Almond
at bayes.stat.washington.edu/almond/belief.html - several free packages generally research only
- commercial packages most powerful ( expensive)
is HUGIN others include Netica and Dxpress - we are working on developing a Java based BN
toolkit here at Watson - will also work within
ABLE
32Road map
- Introduction Bayesian networks
- What are BNs representation, types, etc
- Why use BNs Applications (classes) of BNs
- Information sources, software, etc
- Probabilistic inference
- Exact inference
- Approximate inference
- Learning Bayesian Networks
- Learning parameters
- Learning graph structure
- Summary
33Probabilistic Inference Tasks
- Belief updating
- Finding most probable explanation (MPE)
- Finding maximum a-posteriory hypothesis
- Finding maximum-expected-utility (MEU) decision
34 Belief Updating
Smoking
lung Cancer
Bronchitis
X-ray
Dyspnoea
P (lung canceryes smokingno, dyspnoeayes )
?
35 Belief updating P(Xevidence)?
P(ae0)
B
C
E
D
P(a)
36 Bucket elimination Algorithm elim-bel (Dechter
1996)
37 Finding Algorithm elim-mpe (Dechter 1996)
Elimination operator
38Generating the MPE-tuple
39Complexity of inference
The effect of the ordering
40Other tasks and algorithms
- MAP and MEU tasks
- Similar bucket-elimination algorithms - elim-map,
elim-meu (Dechter 1996) - Elimination operation either summation or
maximization - Restriction on variable ordering summation must
precede maximization (i.e. hypothesis or decision
variables are eliminated last) - Other inference algorithms
- Join-tree clustering
- Pearls poly-tree propagation
- Conditioning, etc.
41Relationship with join-tree
clustering
BCE
ADB
A cluster is a set of buckets (a
super-bucket)
ABC
42Relationship with Pearls belief propagation in
poly-trees
Causal support
Diagnostic support
Pearls belief propagation for
single-root query
elim-bel using topological ordering and
super-buckets for families
Elim-bel, elim-mpe, and elim-map are linear for
poly-trees.
43Road map
- Introduction Bayesian networks
- Probabilistic inference
- Exact inference
- Approximate inference
- Learning Bayesian Networks
- Learning parameters
- Learning graph structure
- Summary
44 Inference is NP-hard gt approximations
- Approximations
- Local inference
- Stochastic simulations
- Variational approximations
- etc.
45Local Inference Idea
46Bucket-elimination approximation mini-buckets
- Local inference idea
- bound the size of recorded dependencies
- Computation in a bucket is time and space
- exponential in the number of variables
involved - Therefore, partition functions in a bucket
- into mini-buckets on smaller number of
variables -
47Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
48Approx-mpe(i)
- Input i max number of variables allowed in a
mini-bucket - Output lower bound (P of a sub-optimal
solution), upper bound -
Example approx-mpe(3) versus elim-mpe
49Properties of approx-mpe(i)
- Complexity O(exp(2i)) time and O(exp(i))
time. - Accuracy determined by upper/lower (U/L) bound.
- As i increases, both accuracy and complexity
increase. - Possible use of mini-bucket approximations
- As anytime algorithms (Dechter and Rish, 1997)
- As heuristics in best-first search (Kask and
Dechter, 1999) - Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997) -
50Anytime Approximation
51Empirical Evaluation(Dechter and Rish, 1997
Rish, 1999)
- Randomly generated networks
- Uniform random probabilities
- Random noisy-OR
- CPCS networks
- Probabilistic decoding
- Comparing approx-mpe and anytime-mpe
- versus elim-mpe
52Random networks
- Uniform random 60 nodes, 90 edges (200
instances) - In 80 of cases, 10-100 times speed-up while
U/Llt2 - Noisy-OR even better results
- Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.
53CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
54Effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
55Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
56approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
57Mini-buckets summary
- Mini-buckets local inference approximation
- Idea bound size of recorded functions
- Approx-mpe(i) - mini-bucket algorithm for MPE
- Better results for noisy-OR than for random
problems - Accuracy increases with decreasing noise in
- Accuracy increases for likely evidence
- Sparser graphs -gt higher accuracy
- Coding networks approx-mpe outperfroms IBP on
low-induced width codes
58Road map
- Introduction Bayesian networks
- Probabilistic inference
- Exact inference
- Approximate inference
- Local inference
- Stochastic simulations
- Variational approximations
- Learning Bayesian Networks
- Summary
59Approximation via Sampling
60Forward Sampling(logic sampling (Henrion, 1988))
61Forward sampling (example)
Drawback high rejection rate!
62Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
63Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
Advantage guaranteed to converge to
P(X) Disadvantage convergence may be slow
64Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
65Road map
- Introduction Bayesian networks
- Probabilistic inference
- Exact inference
- Approximate inference
- Local inference
- Stochastic simulations
- Variational approximations
- Learning Bayesian Networks
- Summary
66Variational Approximations
- Idea
- variational transformation of CPDs simplifies
inference - Advantages
- Compute upper and lower bounds on P(Y)
- Usually faster than sampling techniques
- Disadvantages
- More complex and less general must be derived
for each particular form of CPD functions
67Variational bounds example
log(x)
This approach can be generalized for any concave
(convex) function in order to compute its
upper (lower) bounds convex duality
(Jaakkola and Jordan, 1997)
68Convex duality (Jaakkola and Jordan, 1997)
69Example QMR-DT network(Quick Medical Reference
Decision-Theoretic (Shwe et al., 1991))
600 diseases
4000 findings
Noisy-OR model
70Inference in QMR-DT
factorized
Positive evidence couples the disease nodes
factorized
Inference complexity O(exp(minp,k)) p
of positive findings, k max family
size (Heckerman, 1989 (Quickscore), Rish and
Dechter, 1998)
71Variational approach to QMR-DT(Jaakkola and
Jordan, 1997)
The effect of positive evidence is now factorized
(diseases are decoupled)
72Variational approximations
- Bounds on local CPDs yield a bound on posterior
- Two approaches sequential and block
- Sequential applies variational transformation to
(a subset of) nodes sequentially during inference
using a heuristic node ordering then optimizes
across variational parameters - Block selects in advance nodes to be
transformed, then selects variational parameters
minimizing the KL-distance between true and
approximate posteriors
73Block approach
74Inference in BN summary
- Exact inference is often intractable gt need
approximations - Approximation principles
- Approximating elimination local inference,
bounding size of dependencies among variables
(cliques in a problems graph). - Mini-buckets, IBP
- Other approximations stochastic simulations,
variational techniques, etc. - Further research
- Combining orthogonal approximation approaches
- Better understanding of what works well where
which approximation suits which problem structure - Other approximation paradigms (e.g., other ways
of approximating probabilities, constraints, cost
functions) -
75Road map
- Introduction Bayesian networks
- Probabilistic inference
- Exact inference
- Approximate inference
- Learning Bayesian Networks
- Learning parameters
- Learning graph structure
- Summary
76Why learn Bayesian networks?
- Efficient representation and inference
- Handling missing data lt1.3 2.8 ?? 0 1 gt
77 Learning Bayesian Networks
78Learning Parameterscomplete data
79Learning graph structure
Complete data local computations Incomplete
data (score non-decomposable)stochastic methods
- Constrained-based methods
- Data impose independence
- relations (constraints)
80Learning BNs incomplete data
- Learning parameters
- EM algorithm Lauritzen, 95
- Gibbs Sampling Heckerman, 96
- Gradient Descent Russell et al., 96
- Learning both structure and parameters
- Sum over missing values Cooper Herskovits, 92
Cooper, 95 - Monte-Carlo approaches Heckerman, 96
- Gaussian approximation Heckerman, 96
- Structural EM Friedman, 98
- EM and Multiple Imputation Singh 97,98,00
81Learning Parametersincomplete data
EM-algorithm iterate until convergence
82Learning Parametersincomplete data
(Lauritzen, 95)
- Complete-data log-likelihood is
-
- E step
- Compute E( Nijk Yobs, ??
- M step
- Compute
- ???????????E( Nijk Yobs, ?????E( Nij Yobs, ???
83Learning structure incomplete data
- Depends on the type of missing data - missing
independent of anything else (MCAR) OR missing
based on values of other variables (MAR) - While MCAR can be resolved by decomposable
scores, MAR cannot - For likelihood-based methods, no need to
explicitly model missing data mechanism - Very few attempts at MAR stochastic methods
-
84Learning structure incomplete data
- Approximate EM by using Multiple Imputation to
yield efficient Monte-Carlo method - Singh 97, 98, 00
- trade-off between performance quality
- learned network almost optimal
- approximate complete-data log-likelihood function
using Multiple Imputation - yields decomposable score, dependent only on each
node its parents - converges to local maxima of observed-data
likelihood -
85Learning structure incomplete data
86Scoring functionsMinimum Description Length
(MDL)
- Learning ? data compression
-
- Other MDL -BIC (Bayesian Information
Criterion) - Bayesian score (BDe) - asymptotically equivalent
to MDL
DL(Model)
DL(Datamodel)
87Learning Structure plus Parameters
No. of models is super exponential Alternatives
Model Selection or Model Averaging
88Model Selection
Generally, choose a single model M. Equivalent
to saying P(MD) 1
Task is now to 1) define a metric to decide
which model is
best 2) search for that
model through the
space of all models
89One Reasonable ScorePosterior Probability of a
Structure
structure prior
parameter prior
likelihood
90Global and Local Predictive Scores
Spiegelhalter et al 93
Bayes factor
m
å
...
h
h
log
(
)
log
(
,
,
,
)
p
D
S
p
S
x
x
x
-
l
l
1
1
l
1
L
h
h
h
log
(
)
log
(
,
)
log
(
,
,
)
p
S
p
S
p
S
x
x
x
x
x
x
1
2
1
3
1
2
Local is useful for diagnostic problems
91Local Predictive ScoreSpiegelhalter et al. (1993)
92Exact computation of p(DSh)
- No missing data
- Cases are independent, given the model.
- Uniform priors on parameters
- discrete variables
Cooper Herskovits, 92
93Bayesian Dirichlet ScoreCooper and Herskovits
(1991)
94- Learning BNs without specifying an ordering
- n! ordering ordering greatly affects the quality
of network learned. - use conditional independence tests, and
d-separation to get an ordering
Singh Valtorta 95
95- Learning BNs via the MDL principle
- Idea best model is that which gives the most
compact representation of the data - So, encode the data using the model plus encode
the model. Minimize this.
Lam Bacchus, 93
96Learning BNs summary
- Bayesian Networks graphical probabilistic
models - Efficient representation and inference
- Expert knowledge learning from data
- Learning
- parameters (parameter estimation, EM)
- structure (optimization w/ score functions
e.g., MDL) - Applications/systems collaborative filtering
(MSBN), fraud detection (ATT), classification
(AutoClass (NASA), TAN-BLT(SRI)) - Future directions causality, time, model
evaluation criteria, approximate
inference/learning, on-line learning, etc.