A Tutorial on Inference and Learning in Bayesian Networks

About This Presentation

Title:

A Tutorial on Inference and Learning in Bayesian Networks

Description:

What are BNs: representation, types, etc. Why use BNs: Applications (classes) of BNs ... Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage ... – PowerPoint PPT presentation

Number of Views:552

Avg rating:3.0/5.0

Slides: 97

Provided by: ibm76

Category:

more less

Transcript and Presenter's Notes

Title: A Tutorial on Inference and Learning in Bayesian Networks

1
A Tutorial on Inference and Learning in
Bayesian Networks

Irina Rish Moninder Singh
IBM T.J.Watson Research Center
rish,moninder_at_us.ibm.com

2
Road map

Introduction Bayesian networks
What are BNs representation, types, etc
Why use BNs Applications (classes) of BNs
Information sources, software, etc
Probabilistic inference
Exact inference
Approximate inference
Learning Bayesian Networks
Learning parameters
Learning graph structure
Summary

3
Bayesian Networks
P(A) P(S) P(TA) P(LS) P(BS)
P(CT,L) P(DT,L,B)
P(A, S, T, L, B, C, D)
Lauritzen Spiegelhalter, 95
4
Bayesian Networks

Structured, graphical representation of
probabilistic relationships between several
random variables
Explicit representation of conditional
independencies
Missing arcs encode conditional independence
Efficient representation of joint pdf
Allows arbitrary queries to be answered

P (lung canceryes smokingno, dyspnoeayes )
?
5
Example Printer Troubleshooting (Microsoft
Windows 95)
Heckerman, 95
6
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
7
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
8
Independence Assumptions
9
Independence Assumptions

Nodes X and Y are d-connected by nodes in Z along
a trail from X to Y if
every head-to-head node along the trail is in Z
or has a descendant in Z
every other node along the trail is not in Z

Nodes X and Y are d-separated by nodes in Z if
they are not d-connected by Z along any trail
from X to Y

Nodes X and Y are d-separated by Z implies X and
Y are conditionally independent given Z

10
Independence Assumptions

A variable (node) is conditionally independent of
its
non-descendants given its parents

Smoking
Visit to Asia
Lung Cancer
Bronchitis
Tuberculosis
Chest X-ray
Dyspnoea
11
Independence Assumptions
Cancer is independent of Diet given Exposure to
Toxins and Smoking
Breese Koller, 97
12
Independence Assumptions
What this means is that joint pdf can be
represented as product of local
distributions P(A,S,T,L,B,C,D) P(A) . P(SA) .
P(TA,S) . P(LA,S,T) . P(BA,S,T,L) .
P(CA,S,T,L,B) . P(DA,S,T,L,B,C)
P(A) . P(S) . P(TA) .
P(LS) .P(BS) . P(CT,L) . P(DT,L,B)
13
Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is P(X1,X2,,Xn) P P(Xi
Pa(Xi)) where Pa(Xi) is the
set of parents of Xi
n
i1
14
The Knowledge Acquisition Task

Variables
collectively exhaustive, mutually exclusive
values
clarity test value should be knowable in
principle
Structure
if data available, can be learned
constructed by hand (using expert knowledge)
variable ordering matters causal knowledge
usually simplifies
Probabilities
can be learned from data
second decimal usually does not matter relative
probs
sensitivity analysis

15
The Knowledge Acquisition Task
16
The Knowledge Acquisition Task
Naive Baysian Classifiers DudaHart Langley
92 Selective Naive Bayesian Classifiers
Langley Sage 94 Conditional Trees Geiger 92
Friedman et al 97
17
The Knowledge Acquisition Task
Selective Bayesian Networks Singh Provan,
9596
18
What are BNs useful for?

Diagnosis P(causesymptom)?

Prediction P(symptomcause)?

Decision-making (given a cost function)

Data mining induce best model from data

19
What are BNs useful for?
Cause
Decision Making - Max. Expected Utility
Predictive Inference
Effect
20
What are BNs useful for?
Value of Information
Salient Observations
Fault 1 Fault 2 Fault 3 . . .
Assignment of Belief
New Obs.
Act Now!
Halt?
Yes
No
Next Best Observation (Value of Information)
21
Why use BNs?

Explicit management of uncertainty
Modularity implies maintainability
Better, flexible and robust decision making -
MEU, VOI
Can be used to answer arbitrary queries -
multiple fault problems
Easy to incorporate prior knowledge
Easy to understand

22
Application Examples

Intellipath
commercial version of Pathfinder
lymph-node diseases (60), 100 findings
APRI system developed at ATT Bell Labs
learns uses Bayesian networks from data to
identify customers liable to default on bill
payments
NASA Vista system
predict failures in propulsion systems
considers time criticality suggests highest
utility action
dynamically decide what information to show

23
Application Examples

Answer Wizard in MS Office 95/ MS Project
Bayesian network based free-text help facility
uses naive Bayesian classifiers
Office Assistant in MS Office 97
Extension of Answer wizard
uses naïve Bayesian networks
help based on past experience (keyboard/mouse
use) and task user is doing currently
This is the smiley face you get in your MS
Office applications

24
Application Examples

Microsoft Pregnancy and Child-Care
Available on MSN in Health section
Frequently occuring childrens symptoms are
linked to expert modules that repeatedly ask
parents relevant questions
Asks next best question based on provided
information
Presents articles that are deemed relevant based
on information provided

25
Application Examples

Printer troubleshooting
HP bought 40 stake in HUGIN. Developing printer
troubleshooters for HP printers
Microsoft has 70 online troubleshooters on their
web site
use Bayesian networks - multiple faults models,
incorporate utilities
Fax machine troubleshooting
Ricoh uses Bayesian network based troubleshooters
at call centers
Enabled Ricoh to answer twice the number of calls
in half the time

26
Application Examples
27
Application Examples
28
Application Examples
29
Online/print resources on BNs

Conferences Journals
UAI, ICML, AAAI, AISTAT, KDD
MLJ, DMKD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers
Bayesian Networks without Tears by Eugene
Charniak. AI Magazine Winter 1991.
Probabilistic Reasoning in Intelligent Systems by
Judea Pearl. Morgan Kaufmann 1998.
Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley 1990.
CACM special issue on Real-world applications of
BNs, March 1995

30
Online/Print Resources on BNs

Wealth of online information at www.auai.org
Links to
Electronic proceedings for UAI conferences
Other sites with information on BNs and reasoning
under uncertainty
Several tutorials and important articles
Research groups companies working in this area
Other societies, mailing lists and conferences

31
Publicly available s/w for BNs

List of BN software maintained by Russell Almond
at bayes.stat.washington.edu/almond/belief.html
several free packages generally research only
commercial packages most powerful ( expensive)
is HUGIN others include Netica and Dxpress
we are working on developing a Java based BN
toolkit here at Watson - will also work within
ABLE

32
Road map

Introduction Bayesian networks
What are BNs representation, types, etc
Why use BNs Applications (classes) of BNs
Information sources, software, etc
Probabilistic inference
Exact inference
Approximate inference
Learning Bayesian Networks
Learning parameters
Learning graph structure
Summary

33
Probabilistic Inference Tasks

Belief updating
Finding most probable explanation (MPE)
Finding maximum a-posteriory hypothesis
Finding maximum-expected-utility (MEU) decision

34
Belief Updating
Smoking
lung Cancer
Bronchitis
X-ray
Dyspnoea
P (lung canceryes smokingno, dyspnoeayes )
?
35
Belief updating P(Xevidence)?
P(ae0)
B
C
E
D
P(a)
36
Bucket elimination Algorithm elim-bel (Dechter
1996)
37
Finding Algorithm elim-mpe (Dechter 1996)
Elimination operator
38
Generating the MPE-tuple
39
Complexity of inference
The effect of the ordering
40
Other tasks and algorithms

MAP and MEU tasks
Similar bucket-elimination algorithms - elim-map,
elim-meu (Dechter 1996)
Elimination operation either summation or
maximization
Restriction on variable ordering summation must
precede maximization (i.e. hypothesis or decision
variables are eliminated last)
Other inference algorithms
Join-tree clustering
Pearls poly-tree propagation
Conditioning, etc.

41
Relationship with join-tree
clustering
BCE
ADB
A cluster is a set of buckets (a
super-bucket)
ABC
42
Relationship with Pearls belief propagation in
poly-trees
Causal support
Diagnostic support
Pearls belief propagation for
single-root query
elim-bel using topological ordering and
super-buckets for families
Elim-bel, elim-mpe, and elim-map are linear for
poly-trees.
43
Road map

Introduction Bayesian networks
Probabilistic inference
Exact inference
Approximate inference
Learning Bayesian Networks
Learning parameters
Learning graph structure
Summary

44
Inference is NP-hard gt approximations

Approximations
Local inference
Stochastic simulations
Variational approximations
etc.

45
Local Inference Idea
46
Bucket-elimination approximation mini-buckets

Local inference idea
bound the size of recorded dependencies
Computation in a bucket is time and space
exponential in the number of variables
involved
Therefore, partition functions in a bucket
into mini-buckets on smaller number of
variables

47
Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
48
Approx-mpe(i)

Input i max number of variables allowed in a
mini-bucket
Output lower bound (P of a sub-optimal
solution), upper bound

Example approx-mpe(3) versus elim-mpe
49
Properties of approx-mpe(i)

Complexity O(exp(2i)) time and O(exp(i))
time.
Accuracy determined by upper/lower (U/L) bound.
As i increases, both accuracy and complexity
increase.
Possible use of mini-bucket approximations
As anytime algorithms (Dechter and Rish, 1997)
As heuristics in best-first search (Kask and
Dechter, 1999)
Other tasks similar mini-bucket approximations
for belief updating, MAP and MEU (Dechter and
Rish, 1997)

50
Anytime Approximation
51
Empirical Evaluation(Dechter and Rish, 1997
Rish, 1999)

Randomly generated networks
Uniform random probabilities
Random noisy-OR
CPCS networks
Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe

52
Random networks

Uniform random 60 nodes, 90 edges (200
instances)
In 80 of cases, 10-100 times speed-up while
U/Llt2
Noisy-OR even better results
Exact elim-mpe was infeasible appprox-mpe took
0.1 to 80 sec.

53
CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
54
Effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
55
Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
56
approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
57
Mini-buckets summary

Mini-buckets local inference approximation
Idea bound size of recorded functions
Approx-mpe(i) - mini-bucket algorithm for MPE
Better results for noisy-OR than for random
problems
Accuracy increases with decreasing noise in
Accuracy increases for likely evidence
Sparser graphs -gt higher accuracy
Coding networks approx-mpe outperfroms IBP on
low-induced width codes

58
Road map

Introduction Bayesian networks
Probabilistic inference
Exact inference
Approximate inference
Local inference
Stochastic simulations
Variational approximations
Learning Bayesian Networks
Summary

59
Approximation via Sampling
60
Forward Sampling(logic sampling (Henrion, 1988))

61
Forward sampling (example)
Drawback high rejection rate!
62
Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
63
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
Advantage guaranteed to converge to
P(X) Disadvantage convergence may be slow
64
Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
65
Road map

Introduction Bayesian networks
Probabilistic inference
Exact inference
Approximate inference
Local inference
Stochastic simulations
Variational approximations
Learning Bayesian Networks
Summary

66
Variational Approximations

Idea
variational transformation of CPDs simplifies
inference
Advantages
Compute upper and lower bounds on P(Y)
Usually faster than sampling techniques
Disadvantages
More complex and less general must be derived
for each particular form of CPD functions

67
Variational bounds example
log(x)
This approach can be generalized for any concave
(convex) function in order to compute its
upper (lower) bounds convex duality
(Jaakkola and Jordan, 1997)
68
Convex duality (Jaakkola and Jordan, 1997)
69
Example QMR-DT network(Quick Medical Reference
Decision-Theoretic (Shwe et al., 1991))
600 diseases
4000 findings
Noisy-OR model
70
Inference in QMR-DT
factorized
Positive evidence couples the disease nodes
factorized
Inference complexity O(exp(minp,k)) p
of positive findings, k max family
size (Heckerman, 1989 (Quickscore), Rish and
Dechter, 1998)
71
Variational approach to QMR-DT(Jaakkola and
Jordan, 1997)
The effect of positive evidence is now factorized
(diseases are decoupled)
72
Variational approximations

Bounds on local CPDs yield a bound on posterior
Two approaches sequential and block
Sequential applies variational transformation to
(a subset of) nodes sequentially during inference
using a heuristic node ordering then optimizes
across variational parameters
Block selects in advance nodes to be
transformed, then selects variational parameters
minimizing the KL-distance between true and
approximate posteriors

73
Block approach
74
Inference in BN summary

Exact inference is often intractable gt need
approximations
Approximation principles
Approximating elimination local inference,
bounding size of dependencies among variables
(cliques in a problems graph).
Mini-buckets, IBP
Other approximations stochastic simulations,
variational techniques, etc.
Further research
Combining orthogonal approximation approaches
Better understanding of what works well where
which approximation suits which problem structure
Other approximation paradigms (e.g., other ways
of approximating probabilities, constraints, cost
functions)

75
Road map

Introduction Bayesian networks
Probabilistic inference
Exact inference
Approximate inference
Learning Bayesian Networks
Learning parameters
Learning graph structure
Summary

76
Why learn Bayesian networks?

Efficient representation and inference

Handling missing data lt1.3 2.8 ?? 0 1 gt

77
Learning Bayesian Networks
78
Learning Parameterscomplete data

ML-estimate

79
Learning graph structure

Heuristic search

Complete data local computations Incomplete
data (score non-decomposable)stochastic methods

Constrained-based methods
Data impose independence
relations (constraints)

80
Learning BNs incomplete data

Learning parameters
EM algorithm Lauritzen, 95
Gibbs Sampling Heckerman, 96
Gradient Descent Russell et al., 96
Learning both structure and parameters
Sum over missing values Cooper Herskovits, 92
Cooper, 95
Monte-Carlo approaches Heckerman, 96
Gaussian approximation Heckerman, 96
Structural EM Friedman, 98
EM and Multiple Imputation Singh 97,98,00

81
Learning Parametersincomplete data
EM-algorithm iterate until convergence
82
Learning Parametersincomplete data
(Lauritzen, 95)

Complete-data log-likelihood is
E step
Compute E( Nijk Yobs, ??
M step
Compute
???????????E( Nijk Yobs, ?????E( Nij Yobs, ???

83
Learning structure incomplete data

Depends on the type of missing data - missing
independent of anything else (MCAR) OR missing
based on values of other variables (MAR)
While MCAR can be resolved by decomposable
scores, MAR cannot
For likelihood-based methods, no need to
explicitly model missing data mechanism
Very few attempts at MAR stochastic methods

84
Learning structure incomplete data

Approximate EM by using Multiple Imputation to
yield efficient Monte-Carlo method
Singh 97, 98, 00
trade-off between performance quality
learned network almost optimal
approximate complete-data log-likelihood function
using Multiple Imputation
yields decomposable score, dependent only on each
node its parents
converges to local maxima of observed-data
likelihood

85
Learning structure incomplete data
86
Scoring functionsMinimum Description Length
(MDL)

Learning ? data compression
Other MDL -BIC (Bayesian Information
Criterion)
Bayesian score (BDe) - asymptotically equivalent
to MDL

DL(Model)
DL(Datamodel)
87
Learning Structure plus Parameters
No. of models is super exponential Alternatives
Model Selection or Model Averaging
88
Model Selection
Generally, choose a single model M. Equivalent
to saying P(MD) 1
Task is now to 1) define a metric to decide
which model is
best 2) search for that
model through the
space of all models
89
One Reasonable ScorePosterior Probability of a
Structure
structure prior
parameter prior
likelihood
90
Global and Local Predictive Scores
Spiegelhalter et al 93
Bayes factor
m
å

...
h
h
log
(

)
log
(

,
,
,
)
p
D
S
p
S
x
x
x
-
l
l
1
1

l
1

L
h
h
h
log
(

)
log
(

,
)
log
(

,
,
)
p
S
p
S
p
S
x
x
x
x
x
x
1
2
1
3
1
2
Local is useful for diagnostic problems
91
Local Predictive ScoreSpiegelhalter et al. (1993)
92
Exact computation of p(DSh)

No missing data
Cases are independent, given the model.
Uniform priors on parameters
discrete variables

Cooper Herskovits, 92
93
Bayesian Dirichlet ScoreCooper and Herskovits
(1991)
94

Learning BNs without specifying an ordering
n! ordering ordering greatly affects the quality
of network learned.
use conditional independence tests, and
d-separation to get an ordering

Singh Valtorta 95
95

Learning BNs via the MDL principle
Idea best model is that which gives the most
compact representation of the data
So, encode the data using the model plus encode
the model. Minimize this.

Lam Bacchus, 93
96
Learning BNs summary

Bayesian Networks graphical probabilistic
models
Efficient representation and inference
Expert knowledge learning from data
Learning
parameters (parameter estimation, EM)
structure (optimization w/ score functions
e.g., MDL)
Applications/systems collaborative filtering
(MSBN), fraud detection (ATT), classification
(AutoClass (NASA), TAN-BLT(SRI))
Future directions causality, time, model
evaluation criteria, approximate
inference/learning, on-line learning, etc.