# A Tutorial on Inference and Learning in Bayesian Networks - PowerPoint PPT Presentation

PPT – A Tutorial on Inference and Learning in Bayesian Networks PowerPoint presentation | free to download - id: 10d5b8-MjA4N

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## A Tutorial on Inference and Learning in Bayesian Networks

Description:

### What are BNs: representation, types, etc. Why use BNs: Applications (classes) of BNs ... Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage ... – PowerPoint PPT presentation

Number of Views:443
Avg rating:3.0/5.0
Slides: 97
Provided by: ibm76
Category:
Tags:
Transcript and Presenter's Notes

Title: A Tutorial on Inference and Learning in Bayesian Networks

1
A Tutorial on Inference and Learning in
Bayesian Networks
• Irina Rish Moninder Singh
• IBM T.J.Watson Research Center
• rish,moninder_at_us.ibm.com

2
• Introduction Bayesian networks
• What are BNs representation, types, etc
• Why use BNs Applications (classes) of BNs
• Information sources, software, etc
• Probabilistic inference
• Exact inference
• Approximate inference
• Learning Bayesian Networks
• Learning parameters
• Learning graph structure
• Summary

3
Bayesian Networks
P(A) P(S) P(TA) P(LS) P(BS)
P(CT,L) P(DT,L,B)
P(A, S, T, L, B, C, D)
Lauritzen Spiegelhalter, 95
4
Bayesian Networks
• Structured, graphical representation of
probabilistic relationships between several
random variables
• Explicit representation of conditional
independencies
• Missing arcs encode conditional independence
• Efficient representation of joint pdf
• Allows arbitrary queries to be answered

P (lung canceryes smokingno, dyspnoeayes )
?
5
Example Printer Troubleshooting (Microsoft
Windows 95)
Heckerman, 95
6
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
7
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
8
Independence Assumptions
9
Independence Assumptions
• Nodes X and Y are d-connected by nodes in Z along
a trail from X to Y if
or has a descendant in Z
• every other node along the trail is not in Z
• Nodes X and Y are d-separated by nodes in Z if
they are not d-connected by Z along any trail
from X to Y
• Nodes X and Y are d-separated by Z implies X and
Y are conditionally independent given Z

10
Independence Assumptions
• A variable (node) is conditionally independent of
its
• non-descendants given its parents

Smoking
Visit to Asia
Lung Cancer
Bronchitis
Tuberculosis
Chest X-ray
Dyspnoea
11
Independence Assumptions
Cancer is independent of Diet given Exposure to
Toxins and Smoking
Breese Koller, 97
12
Independence Assumptions
What this means is that joint pdf can be
represented as product of local
distributions P(A,S,T,L,B,C,D) P(A) . P(SA) .
P(TA,S) . P(LA,S,T) . P(BA,S,T,L) .
P(CA,S,T,L,B) . P(DA,S,T,L,B,C)
P(A) . P(S) . P(TA) .
P(LS) .P(BS) . P(CT,L) . P(DT,L,B)
13
Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is P(X1,X2,,Xn) P P(Xi
Pa(Xi)) where Pa(Xi) is the
set of parents of Xi
n
i1
14
• Variables
• collectively exhaustive, mutually exclusive
values
• clarity test value should be knowable in
principle
• Structure
• if data available, can be learned
• constructed by hand (using expert knowledge)
• variable ordering matters causal knowledge
usually simplifies
• Probabilities
• can be learned from data
• second decimal usually does not matter relative
probs
• sensitivity analysis

15
16
Naive Baysian Classifiers DudaHart Langley
92 Selective Naive Bayesian Classifiers
Langley Sage 94 Conditional Trees Geiger 92
Friedman et al 97
17
Selective Bayesian Networks Singh Provan,
9596
18
What are BNs useful for?
• Diagnosis P(causesymptom)?
• Prediction P(symptomcause)?
• Decision-making (given a cost function)
• Data mining induce best model from data

19
What are BNs useful for?
Cause
Decision Making - Max. Expected Utility
Predictive Inference
Effect
20
What are BNs useful for?
Value of Information
Salient Observations
Fault 1 Fault 2 Fault 3 . . .
Assignment of Belief
New Obs.
Act Now!
Halt?
Yes
No
Next Best Observation (Value of Information)
21
Why use BNs?
• Explicit management of uncertainty
• Modularity implies maintainability
• Better, flexible and robust decision making -
MEU, VOI
• Can be used to answer arbitrary queries -
multiple fault problems
• Easy to incorporate prior knowledge
• Easy to understand

22
Application Examples
• Intellipath
• commercial version of Pathfinder
• lymph-node diseases (60), 100 findings
• APRI system developed at ATT Bell Labs
• learns uses Bayesian networks from data to
identify customers liable to default on bill
payments
• NASA Vista system
• predict failures in propulsion systems
• considers time criticality suggests highest
utility action
• dynamically decide what information to show

23
Application Examples
• Answer Wizard in MS Office 95/ MS Project
• Bayesian network based free-text help facility
• uses naive Bayesian classifiers
• Office Assistant in MS Office 97
• uses naïve Bayesian networks
• help based on past experience (keyboard/mouse
use) and task user is doing currently
• This is the smiley face you get in your MS
Office applications

24
Application Examples
• Microsoft Pregnancy and Child-Care
• Available on MSN in Health section
• Frequently occuring childrens symptoms are
parents relevant questions
• Asks next best question based on provided
information
• Presents articles that are deemed relevant based
on information provided

25
Application Examples
• Printer troubleshooting
• HP bought 40 stake in HUGIN. Developing printer
troubleshooters for HP printers
• Microsoft has 70 online troubleshooters on their
web site
• use Bayesian networks - multiple faults models,
incorporate utilities
• Fax machine troubleshooting
• Ricoh uses Bayesian network based troubleshooters
at call centers
• Enabled Ricoh to answer twice the number of calls
in half the time

26
Application Examples
27
Application Examples
28
Application Examples
29
Online/print resources on BNs
• Conferences Journals
• UAI, ICML, AAAI, AISTAT, KDD
• MLJ, DMKD, JAIR, IEEE KDD, IJAR, IEEE PAMI
• Books and Papers
• Bayesian Networks without Tears by Eugene
Charniak. AI Magazine Winter 1991.
• Probabilistic Reasoning in Intelligent Systems by
Judea Pearl. Morgan Kaufmann 1998.
• Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley 1990.
• CACM special issue on Real-world applications of
BNs, March 1995

30
Online/Print Resources on BNs
• Wealth of online information at www.auai.org
• Electronic proceedings for UAI conferences
• Other sites with information on BNs and reasoning
under uncertainty
• Several tutorials and important articles
• Research groups companies working in this area
• Other societies, mailing lists and conferences

31
Publicly available s/w for BNs
• List of BN software maintained by Russell Almond
at bayes.stat.washington.edu/almond/belief.html
• several free packages generally research only
• commercial packages most powerful ( expensive)
is HUGIN others include Netica and Dxpress
• we are working on developing a Java based BN
toolkit here at Watson - will also work within
ABLE

32
• Introduction Bayesian networks
• What are BNs representation, types, etc
• Why use BNs Applications (classes) of BNs
• Information sources, software, etc
• Probabilistic inference
• Exact inference
• Approximate inference
• Learning Bayesian Networks
• Learning parameters
• Learning graph structure
• Summary

33
• Belief updating
• Finding most probable explanation (MPE)
• Finding maximum a-posteriory hypothesis
• Finding maximum-expected-utility (MEU) decision

34
Belief Updating
Smoking
lung Cancer
Bronchitis
X-ray
Dyspnoea
P (lung canceryes smokingno, dyspnoeayes )
?
35
Belief updating P(Xevidence)?
P(ae0)
B
C
E
D
P(a)
36
Bucket elimination Algorithm elim-bel (Dechter
1996)
37
Finding Algorithm elim-mpe (Dechter 1996)
Elimination operator
38
Generating the MPE-tuple
39
Complexity of inference
The effect of the ordering
40
• Similar bucket-elimination algorithms - elim-map,
elim-meu (Dechter 1996)
• Elimination operation either summation or
maximization
• Restriction on variable ordering summation must
precede maximization (i.e. hypothesis or decision
variables are eliminated last)
• Other inference algorithms
• Join-tree clustering
• Pearls poly-tree propagation
• Conditioning, etc.

41
Relationship with join-tree
clustering
BCE
A cluster is a set of buckets (a
super-bucket)
ABC
42
Relationship with Pearls belief propagation in
poly-trees
Causal support
Diagnostic support
Pearls belief propagation for
single-root query
elim-bel using topological ordering and
super-buckets for families
Elim-bel, elim-mpe, and elim-map are linear for
poly-trees.
43
• Introduction Bayesian networks
• Probabilistic inference
• Exact inference
• Approximate inference
• Learning Bayesian Networks
• Learning parameters
• Learning graph structure
• Summary

44
Inference is NP-hard gt approximations
• Approximations
• Local inference
• Stochastic simulations
• Variational approximations
• etc.

45
Local Inference Idea
46
Bucket-elimination approximation mini-buckets
• Local inference idea
• bound the size of recorded dependencies
• Computation in a bucket is time and space
• exponential in the number of variables
involved
• Therefore, partition functions in a bucket
• into mini-buckets on smaller number of
variables

47
Split a bucket into mini-buckets gtbound
complexity
48
Approx-mpe(i)
• Input i max number of variables allowed in a
mini-bucket
• Output lower bound (P of a sub-optimal
solution), upper bound

Example approx-mpe(3) versus elim-mpe
49
Properties of approx-mpe(i)
• Complexity O(exp(2i)) time and O(exp(i))
time.
• Accuracy determined by upper/lower (U/L) bound.
• As i increases, both accuracy and complexity
increase.
• Possible use of mini-bucket approximations
• As anytime algorithms (Dechter and Rish, 1997)
• As heuristics in best-first search (Kask and
Dechter, 1999)
• Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997)

50
Anytime Approximation
51
Empirical Evaluation(Dechter and Rish, 1997
Rish, 1999)
• Randomly generated networks
• Uniform random probabilities
• Random noisy-OR
• CPCS networks
• Probabilistic decoding
• Comparing approx-mpe and anytime-mpe
• versus elim-mpe

52
Random networks
• Uniform random 60 nodes, 90 edges (200
instances)
• In 80 of cases, 10-100 times speed-up while
U/Llt2
• Noisy-OR even better results
• Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.

53
CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
54
Effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
55
Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
56
approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
57
Mini-buckets summary
• Mini-buckets local inference approximation
• Idea bound size of recorded functions
• Approx-mpe(i) - mini-bucket algorithm for MPE
• Better results for noisy-OR than for random
problems
• Accuracy increases with decreasing noise in
• Accuracy increases for likely evidence
• Sparser graphs -gt higher accuracy
• Coding networks approx-mpe outperfroms IBP on
low-induced width codes

58
• Introduction Bayesian networks
• Probabilistic inference
• Exact inference
• Approximate inference
• Local inference
• Stochastic simulations
• Variational approximations
• Learning Bayesian Networks
• Summary

59
Approximation via Sampling
60
Forward Sampling(logic sampling (Henrion, 1988))

61
Forward sampling (example)
Drawback high rejection rate!
62
Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
63
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
P(X) Disadvantage convergence may be slow
64
Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
65
• Introduction Bayesian networks
• Probabilistic inference
• Exact inference
• Approximate inference
• Local inference
• Stochastic simulations
• Variational approximations
• Learning Bayesian Networks
• Summary

66
Variational Approximations
• Idea
• variational transformation of CPDs simplifies
inference
• Compute upper and lower bounds on P(Y)
• Usually faster than sampling techniques
• More complex and less general must be derived
for each particular form of CPD functions

67
Variational bounds example
log(x)
This approach can be generalized for any concave
(convex) function in order to compute its
upper (lower) bounds convex duality
(Jaakkola and Jordan, 1997)
68
Convex duality (Jaakkola and Jordan, 1997)
69
Example QMR-DT network(Quick Medical Reference
Decision-Theoretic (Shwe et al., 1991))
600 diseases
4000 findings
Noisy-OR model
70
Inference in QMR-DT
factorized
Positive evidence couples the disease nodes
factorized
Inference complexity O(exp(minp,k)) p
of positive findings, k max family
size (Heckerman, 1989 (Quickscore), Rish and
Dechter, 1998)
71
Variational approach to QMR-DT(Jaakkola and
Jordan, 1997)
The effect of positive evidence is now factorized
(diseases are decoupled)
72
Variational approximations
• Bounds on local CPDs yield a bound on posterior
• Two approaches sequential and block
• Sequential applies variational transformation to
(a subset of) nodes sequentially during inference
using a heuristic node ordering then optimizes
across variational parameters
• Block selects in advance nodes to be
transformed, then selects variational parameters
minimizing the KL-distance between true and
approximate posteriors

73
Block approach
74
Inference in BN summary
• Exact inference is often intractable gt need
approximations
• Approximation principles
• Approximating elimination local inference,
bounding size of dependencies among variables
(cliques in a problems graph).
• Mini-buckets, IBP
• Other approximations stochastic simulations,
variational techniques, etc.
• Further research
• Combining orthogonal approximation approaches
• Better understanding of what works well where
which approximation suits which problem structure
• Other approximation paradigms (e.g., other ways
of approximating probabilities, constraints, cost
functions)

75
• Introduction Bayesian networks
• Probabilistic inference
• Exact inference
• Approximate inference
• Learning Bayesian Networks
• Learning parameters
• Learning graph structure
• Summary

76
Why learn Bayesian networks?
• Efficient representation and inference
• Handling missing data lt1.3 2.8 ?? 0 1 gt

77
Learning Bayesian Networks
78
Learning Parameterscomplete data
• ML-estimate

79
Learning graph structure
• Heuristic search

Complete data local computations Incomplete
data (score non-decomposable)stochastic methods
• Constrained-based methods
• Data impose independence
• relations (constraints)

80
Learning BNs incomplete data
• Learning parameters
• EM algorithm Lauritzen, 95
• Gibbs Sampling Heckerman, 96
• Gradient Descent Russell et al., 96
• Learning both structure and parameters
• Sum over missing values Cooper Herskovits, 92
Cooper, 95
• Monte-Carlo approaches Heckerman, 96
• Gaussian approximation Heckerman, 96
• Structural EM Friedman, 98
• EM and Multiple Imputation Singh 97,98,00

81
Learning Parametersincomplete data
EM-algorithm iterate until convergence
82
Learning Parametersincomplete data
(Lauritzen, 95)
• Complete-data log-likelihood is
• E step
• Compute E( Nijk Yobs, ??
• M step
• Compute
• ???????????E( Nijk Yobs, ?????E( Nij Yobs, ???

83
Learning structure incomplete data
• Depends on the type of missing data - missing
independent of anything else (MCAR) OR missing
based on values of other variables (MAR)
• While MCAR can be resolved by decomposable
scores, MAR cannot
• For likelihood-based methods, no need to
explicitly model missing data mechanism
• Very few attempts at MAR stochastic methods

84
Learning structure incomplete data
• Approximate EM by using Multiple Imputation to
yield efficient Monte-Carlo method
• Singh 97, 98, 00
• learned network almost optimal
• approximate complete-data log-likelihood function
using Multiple Imputation
• yields decomposable score, dependent only on each
node its parents
• converges to local maxima of observed-data
likelihood

85
Learning structure incomplete data
86
Scoring functionsMinimum Description Length
(MDL)
• Learning ? data compression
• Other MDL -BIC (Bayesian Information
Criterion)
• Bayesian score (BDe) - asymptotically equivalent
to MDL

DL(Model)
DL(Datamodel)
87
Learning Structure plus Parameters
No. of models is super exponential Alternatives
Model Selection or Model Averaging
88
Model Selection
Generally, choose a single model M. Equivalent
to saying P(MD) 1
Task is now to 1) define a metric to decide
which model is
best 2) search for that
model through the
space of all models
89
One Reasonable ScorePosterior Probability of a
Structure
structure prior
parameter prior
likelihood
90
Global and Local Predictive Scores
Spiegelhalter et al 93
Bayes factor
m
å

...
h
h
log
(

)
log
(

,
,
,
)
p
D
S
p
S
x
x
x
-
l
l
1
1

l
1

L
h
h
h
log
(

)
log
(

,
)
log
(

,
,
)
p
S
p
S
p
S
x
x
x
x
x
x
1
2
1
3
1
2
Local is useful for diagnostic problems
91
Local Predictive ScoreSpiegelhalter et al. (1993)
92
Exact computation of p(DSh)
• No missing data
• Cases are independent, given the model.
• Uniform priors on parameters
• discrete variables

Cooper Herskovits, 92
93
Bayesian Dirichlet ScoreCooper and Herskovits
(1991)
94
• Learning BNs without specifying an ordering
• n! ordering ordering greatly affects the quality
of network learned.
• use conditional independence tests, and
d-separation to get an ordering

Singh Valtorta 95
95
• Learning BNs via the MDL principle
• Idea best model is that which gives the most
compact representation of the data
• So, encode the data using the model plus encode
the model. Minimize this.

Lam Bacchus, 93
96
Learning BNs summary
• Bayesian Networks graphical probabilistic
models
• Efficient representation and inference
• Expert knowledge learning from data
• Learning
• parameters (parameter estimation, EM)
• structure (optimization w/ score functions
e.g., MDL)
• Applications/systems collaborative filtering
(MSBN), fraud detection (ATT), classification
(AutoClass (NASA), TAN-BLT(SRI))
• Future directions causality, time, model
evaluation criteria, approximate
inference/learning, on-line learning, etc.