Bayesian Networks and Causal Modelling presentation

About This Presentation

Transcript and Presenter's Notes

Title: Bayesian Networks and Causal Modelling

1
Bayesian Networks and Causal Modelling

Ann Nicholson

School of Computer Science and Software
Engineering Monash University
2
Overview

Introduction to Bayesian Networks (BNs)
Summary of BN research projects
Varieties of Causal intervention
PRICAI2004 K. Korb, L. Hope, A. Nicholson, K.
Axnick
Learning Causal Structure
CaMML software

3
Probability theory for representing uncertainty

Assigns a numerical degree of belief between 0
and 1 to facts
e.g. it will rain today is T/F.
P(it will rain today) 0.2 prior probability
(unconditional)
Posterior probability (conditional)
P(it wil rain today rain is forecast) 0.8
Bayes Rule P(HE) P(EH) x P(H)
P(E)

4
Bayesian networks

A Bayesian Network (BN) represents a probability
distribution graphically (directed acyclic
graphs)
Nodes random variables,
R it is raining, discrete values T/F
T temperature, cts or discrete variable
C colour, discrete values red,blue,green
Arcs indicate conditional dependencies between
variables

P(A,S,T) can be decomposed to P(A)P(SA)P(TA)

5
Bayesian networks (cont.)

There is a conditional probability distribution
(CPD or CPT) associated with each node.
probability of each state given parent states

Jane has the flu
Models causal relationship
Jane has a high temp
Models possible sensor error
Thermometer temp reading
6
BN inference

Evidence observation of specific state
Task compute the posterior probabilities for
query node(s) given evidence.

Te
Te
Diagnostic inference
Predictive inference
Intercausal inference
Mixed inference
7
Causal Networks

Arcs follow the direction of causal process
Causal Networks are always BNs
Bayesian Networks aren't always causal

8
Early BN-related projects

DBNS for discrete monitoring (PhD, 1992)
Approximate BN inference algorithms based on a
mutual information measure for relevance (with
Nathalie Jitnah, 1996-1999)
Plan recognition DBNs for predicting users
actions and goals in an adventure game (with
David Albrecht, Ingrid Zukerman, 1997-2000)
DBNs for ambulation monitoring and fall diagnosis
(with biomedical engineering, 1996-2000)
Bayesian Poker (with Kevin Korb, 1996-2003)

9
Knowledge Engineering with BNs

Seabreeze prediction joint project with Bureau
of Meteorology
Comparison of existing simple rule, expert
elicited BN, and BNs from Tetrad-II and CaMML
ITS for decimal misconceptions
Methodology and tools to support knowledge
engineering process
Matilda visualisation of d-separation
Support for sensitivity analysis
Written a textbook
Bayesian Artificial Intelligence, Kevin B. Korb
and Ann E. Nicholson, Chapman Hall / CRC, 2004.
www.csse.monash.edu.au/bai/book

10
Current BN-related projects

BNs for Epidemiology (with Kevin Korb, Charles
Twardy)
ARC Discovery Grant, 2004
Looking at Coronary Heart Disease data sets
Learning hybrid networks cts and discrete
variables.
BNs for supporting meteorological forecasting
process (DSS2004) (with Ph. D student Tal Boneh,
K. Korb, BoM)
Building domain ontology (in Protege) from expert
elicitation
Automatically generating BN fragments
Case studies Fog, hailstorms, rainfall.
Ecological risk assessment
Goulburn Water, native fish abundance
Sydney Harbour Water Quality

11
Other projects

Autonomous aircraft monitoring and replanning
(with Ph.D. student Tim Wilkin, PRICAI2000,
IAV2004)
Dynamic non-uniform abstraction for approximate
planning with MDPs (with Ph.D. student Jiri Baum)

12
Observation and Intervention

Inference from observations
Predictive reasoning (finding effects)
Diagnostic reasoning (finding causes)
Inference with interventions
Predictive reasoning
Not diagnostic reasoning
Causal reasoning shouldn't go against causality.

Te
Te
Th
Th
Diagnostic inference
Predictive inference
13
Pearlian Determinism

Pearl's reasons for determinism
Determinism is intuitive
Counterfactuals and causal explanation only make
sense with a deterministic interpretation
Any indeterministic model can be transformed into
a deterministic model
We see no reason for assuming determinism

14
Defining Intervention I

Arc cutting
More intuitive
Intervention node
Intervention node
More general interventions
Much easier to implement
To simulate arc cutting P(C ?c, Ic)1
Arc cutting isnt general enough

15
Defining Intervention II

We define an intervention on model M as
M augmented with Ic (M') where
Ic has the purpose of manipulating C
Ic is exogenous (has no parents) in M'
Ic directly causes (is a parent of) C
To preserve the original network
PM'(C ?c, Ic) PM' (C ?c)
where ?c are the original parents of C.
We also define P(C) as the intended distribution.

16
Varieties of Intervention Dependency

The degree of dependency of the effect upon
existing parents.
An independent intervention cuts the child off
from its other parents. Thus,
PM'(C ?c, Ic) P(C)
A dependent intervention allows any parent
interaction.

17
Varieties of Intervention Indeterminism

The degree of indeterminism of the effect.
A deterministic intervention sets the child to
one particular state.
A stochastic intervention sets the child to a
positive distribution.
Dependency and Determinism
characterize any intervention
Pearlian interventions are independent and
deterministic

18
Varieties of Intervention Effectiveness

We've found the idea of effectiveness useful.
If P(C) is what's intended and r is the
effectiveness, then
PM'(C ?c, Ic) r P(C) (1-r) PM'(C
?c)
This is a dependent intervention.

19
Demo of Causal Intervention Software
20
Summary of Causal Intervention

A taxonomy of intervention types
More realistic interventions (e.g., partial
effectiveness)
A GUI which handles some varieties of
intervention
Pearlian
Partially effective
Extensible to deal with other types of
interaction explicitly

21
Learning Causal Structure

This is the real problem parameterizing models
is relatively straightforward estimation problem.
Size of the dag space is superexponential
Number of possible orderings n!
Times number of possible arcs Cn2
Minus number of possible cyclic graphs
More exactly (Robinson, 1977)
f(n) ?(-1)i1 Cni 2i(n-i)f(n-i)
so for
n3, f(n)25
n5, f(n)25,000
n10, f(n) ? 4.2x1018

22
Learning Causal Structure

There are two basic methods
Learning from conditional independencies (CI
learning)
Learning using a scoring metric (Metric learning)
CI learning (Verma and Pearl, 1991)
Suppose you have an Oracle who can answer yes or
no to any question of the type
is X conditional independence Y given S?
Then you can learn the correct causal model, up
to statistical equivalence (patterns).

23
Statistical Equivalence

Two causal models H1 and H2 are statistically
equivalent iff they contain the same variables
and joint samples over them provide no
statistical grounds for preferring one over the
other.
Examples
All fully connected models are equivalent.
A ? B ? C and A ? B ? C.
A ? B ? D ? C and A ? B ? D ? C.

24
Statistical Equivalence (cont.)

(Verma and Pearl, 1991) Any two causal models
over the same variables which have the same
skeleton (undirected arcs) and the same directed
v-structures are statistically equivalent.
Chickering (1995) If H1 and H2 are statistically
equivalent, then they have the same maximum
likelihoods relative to any joint samples
max P(eH1,?1) max P(eH2,?2)
where ?i is a parameterization of Hi

25
Other approaches to structure learning

TETRAD II Spirtes, Glymour and Scheines (1993).
Implemented in their PC algorithm
Doesn't handle well with weak links and small
samples (demonstrated empirically in Dai, Korb,
Wallace Wu (1997)).
Bayesian LBN Cooper Herskovits' K2 (1991,
1992)
Compute P(hie) by brute force, under the various
assumptions which reduce the computation of
PCH(h,e) to a polynomial time counting problem.
But the hypothesis space is exponential they go
for dramatic simplification by assuming we know
the temporal ordering of the variables.

26
Learning Variable Order

Reliance upon a given variable order is a major
drawback to K2
And many other algorithms (Buntine 1991, Bouckert
1994, Suzuki 1996, Madigan Raftery 1994)
What's wrong with that?
We want autonomous AI (data mining). If experts
can order the variables they can likely supply
models.
Determining variable ordering is half the
problem. If we know A comes before B, the only
remaining issue is whether there is a link
between the two.
The number of orderings consistent with dags is
exponential (Brightwell Winkler 1990 number
complete). So iterating over all possible
orderings will not scale up.

27
Statistical Equivalence Learners

Heckerman Geiger (1995) advocate learning only
up to statistical equivalence classes (a la
TETRAD II).
Since observational data cannot distinguish btw
equivalent models, there's no point trying to go
further.
? Madigan, Andersson, Perlman Volinsky (1996)
follow this advice, use uniform prior over
equivalence classes.
? Geiger and Heckerman (1994) define Bayesian
metrics for linear and discrete equivalence
classes of models (BGe and BDe)

28
Statistical Equivalence Learners

Wallace Korb (1999) This is not right!
These are causal models they are distinguishable
on experimental data.
Failure to collect some data is no reason to
change prior probabilities.
E.g., If your thermometer topped out at 35C,
you wouldn't treat ? 35C and 34C as equally
likely.
Not all equivalence classes are created equal
A ? B ? C, A ? B ? C, A ? B ? C
A ? B ? C
Within classes some dags should have greater
priors than others E.g.,
LightsOn ? InOffice ? LoggedOn v.
LightsOn ? InOffice ? LoggedOn

29
Full Causal Learners

So a full causal learner is an algorithm that
Learns causal connectedness.
Learns v-structures. Hence, learns equivalence
classes.
Learns full variable order. Hence, learns full
causal structure (order connectedness).
TETRAD II 1, 2.
Madigan et al. Heckerman Geiger (BGe, BDe) 1,
2.
Cooper Herskovits' K2 1.
Lam and Bacchus MDL 1, 2 (partial), 3 (partial).
Wallace, Neil, Korb MML 1, 2, 3.

30
CaMML

Minimum Message Length (Wallace \ Boulton 1968)
uses Shannon's measure of information
I(m) - log P(m)
Applied in reverse, we can compute P(h,e) from
I(h,e).
Given an efficient joint encoding method for the
hypothesis evidence space (i.e., satisfying
Shannon's law), MML
Searches hi for that hypothesis h that
minimizes I(h) I(eh).
Applies a trade-off between
Model simplicity
Data fit
Equivalent to that h that maximizes P(h)P(eh)
--- i.e., P(he).

31
MML search algorithms

MML metrics need to be combined with search.
This has been done three ways
Wallace, Korb, Dai (1996) greedy search
(linear).
Brute force computation of linear extensions
(small models only)
Neil and Korb (1999) genetic algorithms
(linear).
Asymptotic estimator of linear extensions
GA chromosomes causal models
Genetic operators manipulate them
Selection pressure is based on MML
Wallace and Korb (1999) MML sampling (linear,
discrete).
Stochastic sampling through space of totally
ordered causal models
No counting of linear extensions required

32
Empirical Results

A weakness in this area --- and AI generally.
Papers based upon very small models, loose
comparisons.
ALARM often used --- everything gets it to within
1 or 2 arcs.
Neil and Korb (1999) compared CaMML and BGe
(Heckerman Geiger's Bayesian metric over
equivalence classes), using identical GA search
over linear models
On KL distance and topological distance from the
true model, CaMML and BGe performed nearly the
same.
On test prediction accuracy on strict effect
nodes (those with no children), CaMML clearly
outperformed BGe.

33
Extensions to original CaMML

Allow specification of prior on arc
ODonnell, Korb, Nicholson
Useful for combining expert and automated methods
Learning local structure
Logit models (Neill, Wallace, Korb)
Hybrid networks - CPT or decision trees
(ODonnell, Allison, Korb, Hope) (Uses MCMC
search)

Bayesian Networks and Causal Modelling PowerPoint PPT Presentation