A Tutorial on Inference and Learning in Bayesian Networks - PowerPoint PPT Presentation

Loading...

PPT – A Tutorial on Inference and Learning in Bayesian Networks PowerPoint presentation | free to download - id: 10d5b8-MjA4N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

A Tutorial on Inference and Learning in Bayesian Networks

Description:

What are BNs: representation, types, etc. Why use BNs: Applications (classes) of BNs ... Hart; Langley 92] Selective Naive Bayesian Classifiers [Langley & Sage ... – PowerPoint PPT presentation

Number of Views:443
Avg rating:3.0/5.0
Slides: 97
Provided by: ibm76
Learn more at: http://www.research.ibm.com
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Tutorial on Inference and Learning in Bayesian Networks


1
A Tutorial on Inference and Learning in
Bayesian Networks
  • Irina Rish Moninder Singh
  • IBM T.J.Watson Research Center
  • rish,moninder_at_us.ibm.com

2
Road map
  • Introduction Bayesian networks
  • What are BNs representation, types, etc
  • Why use BNs Applications (classes) of BNs
  • Information sources, software, etc
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Learning Bayesian Networks
  • Learning parameters
  • Learning graph structure
  • Summary

3
Bayesian Networks
P(A) P(S) P(TA) P(LS) P(BS)
P(CT,L) P(DT,L,B)
P(A, S, T, L, B, C, D)
Lauritzen Spiegelhalter, 95
4
Bayesian Networks
  • Structured, graphical representation of
    probabilistic relationships between several
    random variables
  • Explicit representation of conditional
    independencies
  • Missing arcs encode conditional independence
  • Efficient representation of joint pdf
  • Allows arbitrary queries to be answered

P (lung canceryes smokingno, dyspnoeayes )
?
5
Example Printer Troubleshooting (Microsoft
Windows 95)
Heckerman, 95
6
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
7
Example Microsoft Pregnancy and Child Care)
Heckerman, 95
8
Independence Assumptions
9
Independence Assumptions
  • Nodes X and Y are d-connected by nodes in Z along
    a trail from X to Y if
  • every head-to-head node along the trail is in Z
    or has a descendant in Z
  • every other node along the trail is not in Z
  • Nodes X and Y are d-separated by nodes in Z if
    they are not d-connected by Z along any trail
    from X to Y
  • Nodes X and Y are d-separated by Z implies X and
    Y are conditionally independent given Z

10
Independence Assumptions
  • A variable (node) is conditionally independent of
    its
  • non-descendants given its parents

Smoking
Visit to Asia
Lung Cancer
Bronchitis
Tuberculosis
Chest X-ray
Dyspnoea
11
Independence Assumptions
Cancer is independent of Diet given Exposure to
Toxins and Smoking
Breese Koller, 97
12
Independence Assumptions
What this means is that joint pdf can be
represented as product of local
distributions P(A,S,T,L,B,C,D) P(A) . P(SA) .
P(TA,S) . P(LA,S,T) . P(BA,S,T,L) .
P(CA,S,T,L,B) . P(DA,S,T,L,B,C)
P(A) . P(S) . P(TA) .
P(LS) .P(BS) . P(CT,L) . P(DT,L,B)
13
Independence Assumptions
Thus, the General Product rule for Bayesian
Networks is P(X1,X2,,Xn) P P(Xi
Pa(Xi)) where Pa(Xi) is the
set of parents of Xi
n
i1
14
The Knowledge Acquisition Task
  • Variables
  • collectively exhaustive, mutually exclusive
    values
  • clarity test value should be knowable in
    principle
  • Structure
  • if data available, can be learned
  • constructed by hand (using expert knowledge)
  • variable ordering matters causal knowledge
    usually simplifies
  • Probabilities
  • can be learned from data
  • second decimal usually does not matter relative
    probs
  • sensitivity analysis

15
The Knowledge Acquisition Task
16
The Knowledge Acquisition Task
Naive Baysian Classifiers DudaHart Langley
92 Selective Naive Bayesian Classifiers
Langley Sage 94 Conditional Trees Geiger 92
Friedman et al 97
17
The Knowledge Acquisition Task
Selective Bayesian Networks Singh Provan,
9596
18
What are BNs useful for?
  • Diagnosis P(causesymptom)?
  • Prediction P(symptomcause)?
  • Decision-making (given a cost function)
  • Data mining induce best model from data

19
What are BNs useful for?
Cause
Decision Making - Max. Expected Utility
Predictive Inference
Effect
20
What are BNs useful for?
Value of Information
Salient Observations
Fault 1 Fault 2 Fault 3 . . .
Assignment of Belief
New Obs.
Act Now!
Halt?
Yes
No
Next Best Observation (Value of Information)
21
Why use BNs?
  • Explicit management of uncertainty
  • Modularity implies maintainability
  • Better, flexible and robust decision making -
    MEU, VOI
  • Can be used to answer arbitrary queries -
    multiple fault problems
  • Easy to incorporate prior knowledge
  • Easy to understand

22
Application Examples
  • Intellipath
  • commercial version of Pathfinder
  • lymph-node diseases (60), 100 findings
  • APRI system developed at ATT Bell Labs
  • learns uses Bayesian networks from data to
    identify customers liable to default on bill
    payments
  • NASA Vista system
  • predict failures in propulsion systems
  • considers time criticality suggests highest
    utility action
  • dynamically decide what information to show

23
Application Examples
  • Answer Wizard in MS Office 95/ MS Project
  • Bayesian network based free-text help facility
  • uses naive Bayesian classifiers
  • Office Assistant in MS Office 97
  • Extension of Answer wizard
  • uses naïve Bayesian networks
  • help based on past experience (keyboard/mouse
    use) and task user is doing currently
  • This is the smiley face you get in your MS
    Office applications

24
Application Examples
  • Microsoft Pregnancy and Child-Care
  • Available on MSN in Health section
  • Frequently occuring childrens symptoms are
    linked to expert modules that repeatedly ask
    parents relevant questions
  • Asks next best question based on provided
    information
  • Presents articles that are deemed relevant based
    on information provided

25
Application Examples
  • Printer troubleshooting
  • HP bought 40 stake in HUGIN. Developing printer
    troubleshooters for HP printers
  • Microsoft has 70 online troubleshooters on their
    web site
  • use Bayesian networks - multiple faults models,
    incorporate utilities
  • Fax machine troubleshooting
  • Ricoh uses Bayesian network based troubleshooters
    at call centers
  • Enabled Ricoh to answer twice the number of calls
    in half the time

26
Application Examples
27
Application Examples
28
Application Examples
29
Online/print resources on BNs
  • Conferences Journals
  • UAI, ICML, AAAI, AISTAT, KDD
  • MLJ, DMKD, JAIR, IEEE KDD, IJAR, IEEE PAMI
  • Books and Papers
  • Bayesian Networks without Tears by Eugene
    Charniak. AI Magazine Winter 1991.
  • Probabilistic Reasoning in Intelligent Systems by
    Judea Pearl. Morgan Kaufmann 1998.
  • Probabilistic Reasoning in Expert Systems by
    Richard Neapolitan. Wiley 1990.
  • CACM special issue on Real-world applications of
    BNs, March 1995

30
Online/Print Resources on BNs
  • Wealth of online information at www.auai.org
    Links to
  • Electronic proceedings for UAI conferences
  • Other sites with information on BNs and reasoning
    under uncertainty
  • Several tutorials and important articles
  • Research groups companies working in this area
  • Other societies, mailing lists and conferences

31
Publicly available s/w for BNs
  • List of BN software maintained by Russell Almond
    at bayes.stat.washington.edu/almond/belief.html
  • several free packages generally research only
  • commercial packages most powerful ( expensive)
    is HUGIN others include Netica and Dxpress
  • we are working on developing a Java based BN
    toolkit here at Watson - will also work within
    ABLE

32
Road map
  • Introduction Bayesian networks
  • What are BNs representation, types, etc
  • Why use BNs Applications (classes) of BNs
  • Information sources, software, etc
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Learning Bayesian Networks
  • Learning parameters
  • Learning graph structure
  • Summary

33
Probabilistic Inference Tasks
  • Belief updating
  • Finding most probable explanation (MPE)
  • Finding maximum a-posteriory hypothesis
  • Finding maximum-expected-utility (MEU) decision

34
Belief Updating
Smoking
lung Cancer
Bronchitis
X-ray
Dyspnoea
P (lung canceryes smokingno, dyspnoeayes )
?
35
Belief updating P(Xevidence)?
P(ae0)
B
C
E
D
P(a)
36
Bucket elimination Algorithm elim-bel (Dechter
1996)
37
Finding Algorithm elim-mpe (Dechter 1996)
Elimination operator
38
Generating the MPE-tuple
39
Complexity of inference
The effect of the ordering
40
Other tasks and algorithms
  • MAP and MEU tasks
  • Similar bucket-elimination algorithms - elim-map,
    elim-meu (Dechter 1996)
  • Elimination operation either summation or
    maximization
  • Restriction on variable ordering summation must
    precede maximization (i.e. hypothesis or decision
    variables are eliminated last)
  • Other inference algorithms
  • Join-tree clustering
  • Pearls poly-tree propagation
  • Conditioning, etc.

41
Relationship with join-tree
clustering
BCE
ADB
A cluster is a set of buckets (a
super-bucket)
ABC
42
Relationship with Pearls belief propagation in
poly-trees
Causal support
Diagnostic support
Pearls belief propagation for
single-root query
elim-bel using topological ordering and
super-buckets for families
Elim-bel, elim-mpe, and elim-map are linear for
poly-trees.
43
Road map
  • Introduction Bayesian networks
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Learning Bayesian Networks
  • Learning parameters
  • Learning graph structure
  • Summary

44
Inference is NP-hard gt approximations
  • Approximations
  • Local inference
  • Stochastic simulations
  • Variational approximations
  • etc.

45
Local Inference Idea
46
Bucket-elimination approximation mini-buckets
  • Local inference idea
  • bound the size of recorded dependencies
  • Computation in a bucket is time and space
  • exponential in the number of variables
    involved
  • Therefore, partition functions in a bucket
  • into mini-buckets on smaller number of
    variables

47
Mini-bucket approximation MPE task
Split a bucket into mini-buckets gtbound
complexity
48
Approx-mpe(i)
  • Input i max number of variables allowed in a
    mini-bucket
  • Output lower bound (P of a sub-optimal
    solution), upper bound

Example approx-mpe(3) versus elim-mpe
49
Properties of approx-mpe(i)
  • Complexity O(exp(2i)) time and O(exp(i))
    time.
  • Accuracy determined by upper/lower (U/L) bound.
  • As i increases, both accuracy and complexity
    increase.
  • Possible use of mini-bucket approximations
  • As anytime algorithms (Dechter and Rish, 1997)
  • As heuristics in best-first search (Kask and
    Dechter, 1999)
  • Other tasks similar mini-bucket approximations
    for belief updating, MAP and MEU (Dechter and
    Rish, 1997)

50
Anytime Approximation
51
Empirical Evaluation(Dechter and Rish, 1997
Rish, 1999)
  • Randomly generated networks
  • Uniform random probabilities
  • Random noisy-OR
  • CPCS networks
  • Probabilistic decoding
  • Comparing approx-mpe and anytime-mpe
  • versus elim-mpe

52
Random networks
  • Uniform random 60 nodes, 90 edges (200
    instances)
  • In 80 of cases, 10-100 times speed-up while
    U/Llt2
  • Noisy-OR even better results
  • Exact elim-mpe was infeasible appprox-mpe took
    0.1 to 80 sec.

53
CPCS networks medical diagnosis(noisy-OR model)
Test case no evidence
54
Effect of evidence
More likely evidencegthigher MPE gt higher
accuracy (why?)
Likely evidence versus random (unlikely) evidence
55
Probabilistic decoding
Error-correcting linear block code
State-of-the-art
approximate algorithm iterative belief
propagation (IBP) (Pearls poly-tree algorithm
applied to loopy networks)
56
approx-mpe vs. IBP
Bit error rate (BER) as a function of noise
(sigma)
57
Mini-buckets summary
  • Mini-buckets local inference approximation
  • Idea bound size of recorded functions
  • Approx-mpe(i) - mini-bucket algorithm for MPE
  • Better results for noisy-OR than for random
    problems
  • Accuracy increases with decreasing noise in
  • Accuracy increases for likely evidence
  • Sparser graphs -gt higher accuracy
  • Coding networks approx-mpe outperfroms IBP on
    low-induced width codes

58
Road map
  • Introduction Bayesian networks
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Local inference
  • Stochastic simulations
  • Variational approximations
  • Learning Bayesian Networks
  • Summary

59
Approximation via Sampling
60
Forward Sampling(logic sampling (Henrion, 1988))

61
Forward sampling (example)
Drawback high rejection rate!
62
Likelihood Weighing(Fung and Chang, 1990
Shachter and Peot, 1990)
Clamping evidenceforward sampling weighing
samples by evidence likelihood
Works well for likely evidence!
63
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC) create a Markov
chain of samples
Advantage guaranteed to converge to
P(X) Disadvantage convergence may be slow
64
Gibbs Sampling (contd)(Pearl, 1988)
Markov blanket
65
Road map
  • Introduction Bayesian networks
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Local inference
  • Stochastic simulations
  • Variational approximations
  • Learning Bayesian Networks
  • Summary

66
Variational Approximations
  • Idea
  • variational transformation of CPDs simplifies
    inference
  • Advantages
  • Compute upper and lower bounds on P(Y)
  • Usually faster than sampling techniques
  • Disadvantages
  • More complex and less general must be derived
    for each particular form of CPD functions

67
Variational bounds example
log(x)
This approach can be generalized for any concave
(convex) function in order to compute its
upper (lower) bounds convex duality
(Jaakkola and Jordan, 1997)
68
Convex duality (Jaakkola and Jordan, 1997)
69
Example QMR-DT network(Quick Medical Reference
Decision-Theoretic (Shwe et al., 1991))
600 diseases
4000 findings
Noisy-OR model
70
Inference in QMR-DT
factorized
Positive evidence couples the disease nodes
factorized
Inference complexity O(exp(minp,k)) p
of positive findings, k max family
size (Heckerman, 1989 (Quickscore), Rish and
Dechter, 1998)
71
Variational approach to QMR-DT(Jaakkola and
Jordan, 1997)
The effect of positive evidence is now factorized
(diseases are decoupled)
72
Variational approximations
  • Bounds on local CPDs yield a bound on posterior
  • Two approaches sequential and block
  • Sequential applies variational transformation to
    (a subset of) nodes sequentially during inference
    using a heuristic node ordering then optimizes
    across variational parameters
  • Block selects in advance nodes to be
    transformed, then selects variational parameters
    minimizing the KL-distance between true and
    approximate posteriors

73
Block approach
74
Inference in BN summary
  • Exact inference is often intractable gt need
    approximations
  • Approximation principles
  • Approximating elimination local inference,
    bounding size of dependencies among variables
    (cliques in a problems graph).
  • Mini-buckets, IBP
  • Other approximations stochastic simulations,
    variational techniques, etc.
  • Further research
  • Combining orthogonal approximation approaches
  • Better understanding of what works well where
    which approximation suits which problem structure
  • Other approximation paradigms (e.g., other ways
    of approximating probabilities, constraints, cost
    functions)

75
Road map
  • Introduction Bayesian networks
  • Probabilistic inference
  • Exact inference
  • Approximate inference
  • Learning Bayesian Networks
  • Learning parameters
  • Learning graph structure
  • Summary

76
Why learn Bayesian networks?
  • Efficient representation and inference
  • Handling missing data lt1.3 2.8 ?? 0 1 gt

77
Learning Bayesian Networks
78
Learning Parameterscomplete data
  • ML-estimate

79
Learning graph structure
  • Heuristic search

Complete data local computations Incomplete
data (score non-decomposable)stochastic methods
  • Constrained-based methods
  • Data impose independence
  • relations (constraints)

80
Learning BNs incomplete data
  • Learning parameters
  • EM algorithm Lauritzen, 95
  • Gibbs Sampling Heckerman, 96
  • Gradient Descent Russell et al., 96
  • Learning both structure and parameters
  • Sum over missing values Cooper Herskovits, 92
    Cooper, 95
  • Monte-Carlo approaches Heckerman, 96
  • Gaussian approximation Heckerman, 96
  • Structural EM Friedman, 98
  • EM and Multiple Imputation Singh 97,98,00

81
Learning Parametersincomplete data
EM-algorithm iterate until convergence
82
Learning Parametersincomplete data
(Lauritzen, 95)
  • Complete-data log-likelihood is
  • E step
  • Compute E( Nijk Yobs, ??
  • M step
  • Compute
  • ???????????E( Nijk Yobs, ?????E( Nij Yobs, ???

83
Learning structure incomplete data
  • Depends on the type of missing data - missing
    independent of anything else (MCAR) OR missing
    based on values of other variables (MAR)
  • While MCAR can be resolved by decomposable
    scores, MAR cannot
  • For likelihood-based methods, no need to
    explicitly model missing data mechanism
  • Very few attempts at MAR stochastic methods

84
Learning structure incomplete data
  • Approximate EM by using Multiple Imputation to
    yield efficient Monte-Carlo method
  • Singh 97, 98, 00
  • trade-off between performance quality
  • learned network almost optimal
  • approximate complete-data log-likelihood function
    using Multiple Imputation
  • yields decomposable score, dependent only on each
    node its parents
  • converges to local maxima of observed-data
    likelihood

85
Learning structure incomplete data
86
Scoring functionsMinimum Description Length
(MDL)
  • Learning ? data compression
  • Other MDL -BIC (Bayesian Information
    Criterion)
  • Bayesian score (BDe) - asymptotically equivalent
    to MDL

DL(Model)
DL(Datamodel)
87
Learning Structure plus Parameters
No. of models is super exponential Alternatives
Model Selection or Model Averaging
88
Model Selection
Generally, choose a single model M. Equivalent
to saying P(MD) 1
Task is now to 1) define a metric to decide
which model is
best 2) search for that
model through the
space of all models
89
One Reasonable ScorePosterior Probability of a
Structure
structure prior
parameter prior
likelihood
90
Global and Local Predictive Scores
Spiegelhalter et al 93
Bayes factor
m
å

...
h
h
log
(

)
log
(

,
,
,
)
p
D
S
p
S
x
x
x
-
l
l
1
1

l
1




L
h
h
h
log
(

)
log
(

,
)
log
(

,
,
)
p
S
p
S
p
S
x
x
x
x
x
x
1
2
1
3
1
2
Local is useful for diagnostic problems
91
Local Predictive ScoreSpiegelhalter et al. (1993)
92
Exact computation of p(DSh)
  • No missing data
  • Cases are independent, given the model.
  • Uniform priors on parameters
  • discrete variables

Cooper Herskovits, 92
93
Bayesian Dirichlet ScoreCooper and Herskovits
(1991)
94
  • Learning BNs without specifying an ordering
  • n! ordering ordering greatly affects the quality
    of network learned.
  • use conditional independence tests, and
    d-separation to get an ordering

Singh Valtorta 95
95
  • Learning BNs via the MDL principle
  • Idea best model is that which gives the most
    compact representation of the data
  • So, encode the data using the model plus encode
    the model. Minimize this.

Lam Bacchus, 93
96
Learning BNs summary
  • Bayesian Networks graphical probabilistic
    models
  • Efficient representation and inference
  • Expert knowledge learning from data
  • Learning
  • parameters (parameter estimation, EM)
  • structure (optimization w/ score functions
    e.g., MDL)
  • Applications/systems collaborative filtering
    (MSBN), fraud detection (ATT), classification
    (AutoClass (NASA), TAN-BLT(SRI))
  • Future directions causality, time, model
    evaluation criteria, approximate
    inference/learning, on-line learning, etc.
About PowerShow.com