Bayesian Networks and Causal Modelling - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Bayesian Networks and Causal Modelling

Description:

Bayesian Networks and Causal Modelling Ann Nicholson School of Computer Science and Software Engineering Monash University Overview Introduction to Bayesian Networks ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 35
Provided by: edua2214
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Networks and Causal Modelling


1
Bayesian Networks and Causal Modelling
  • Ann Nicholson

School of Computer Science and Software
Engineering Monash University
2
Overview
  • Introduction to Bayesian Networks (BNs)
  • Summary of BN research projects
  • Varieties of Causal intervention
  • PRICAI2004 K. Korb, L. Hope, A. Nicholson, K.
    Axnick
  • Learning Causal Structure
  • CaMML software

3
Probability theory for representing uncertainty
  • Assigns a numerical degree of belief between 0
    and 1 to facts
  • e.g. it will rain today is T/F.
  • P(it will rain today) 0.2 prior probability
    (unconditional)
  • Posterior probability (conditional)
  • P(it wil rain today rain is forecast) 0.8
  • Bayes Rule P(HE) P(EH) x P(H)

  • P(E)

4
Bayesian networks
  • A Bayesian Network (BN) represents a probability
    distribution graphically (directed acyclic
    graphs)
  • Nodes random variables,
  • R it is raining, discrete values T/F
  • T temperature, cts or discrete variable
  • C colour, discrete values red,blue,green
  • Arcs indicate conditional dependencies between
    variables
  • P(A,S,T) can be decomposed to P(A)P(SA)P(TA)

5
Bayesian networks (cont.)
  • There is a conditional probability distribution
    (CPD or CPT) associated with each node.
  • probability of each state given parent states

Jane has the flu
Models causal relationship
Jane has a high temp
Models possible sensor error
Thermometer temp reading
6
BN inference
  • Evidence observation of specific state
  • Task compute the posterior probabilities for
    query node(s) given evidence.

Te
Te
Diagnostic inference
Predictive inference
Intercausal inference
Mixed inference
7
Causal Networks
  • Arcs follow the direction of causal process
  • Causal Networks are always BNs
  • Bayesian Networks aren't always causal

8
Early BN-related projects
  • DBNS for discrete monitoring (PhD, 1992)
  • Approximate BN inference algorithms based on a
    mutual information measure for relevance (with
    Nathalie Jitnah, 1996-1999)
  • Plan recognition DBNs for predicting users
    actions and goals in an adventure game (with
    David Albrecht, Ingrid Zukerman, 1997-2000)
  • DBNs for ambulation monitoring and fall diagnosis
    (with biomedical engineering, 1996-2000)
  • Bayesian Poker (with Kevin Korb, 1996-2003)

9
Knowledge Engineering with BNs
  • Seabreeze prediction joint project with Bureau
    of Meteorology
  • Comparison of existing simple rule, expert
    elicited BN, and BNs from Tetrad-II and CaMML
  • ITS for decimal misconceptions
  • Methodology and tools to support knowledge
    engineering process
  • Matilda visualisation of d-separation
  • Support for sensitivity analysis
  • Written a textbook
  • Bayesian Artificial Intelligence, Kevin B. Korb
    and Ann E. Nicholson, Chapman Hall / CRC, 2004.
  • www.csse.monash.edu.au/bai/book

10
Current BN-related projects
  • BNs for Epidemiology (with Kevin Korb, Charles
    Twardy)
  • ARC Discovery Grant, 2004
  • Looking at Coronary Heart Disease data sets
  • Learning hybrid networks cts and discrete
    variables.
  • BNs for supporting meteorological forecasting
    process (DSS2004) (with Ph. D student Tal Boneh,
    K. Korb, BoM)
  • Building domain ontology (in Protege) from expert
    elicitation
  • Automatically generating BN fragments
  • Case studies Fog, hailstorms, rainfall.
  • Ecological risk assessment
  • Goulburn Water, native fish abundance
  • Sydney Harbour Water Quality

11
Other projects
  • Autonomous aircraft monitoring and replanning
    (with Ph.D. student Tim Wilkin, PRICAI2000,
    IAV2004)
  • Dynamic non-uniform abstraction for approximate
    planning with MDPs (with Ph.D. student Jiri Baum)

12
Observation and Intervention
  • Inference from observations
  • Predictive reasoning (finding effects)
  • Diagnostic reasoning (finding causes)
  • Inference with interventions
  • Predictive reasoning
  • Not diagnostic reasoning
  • Causal reasoning shouldn't go against causality.

Te
Te
Th
Th
Diagnostic inference
Predictive inference
13
Pearlian Determinism
  • Pearl's reasons for determinism
  • Determinism is intuitive
  • Counterfactuals and causal explanation only make
    sense with a deterministic interpretation
  • Any indeterministic model can be transformed into
    a deterministic model
  • We see no reason for assuming determinism

14
Defining Intervention I
  • Arc cutting
  • More intuitive
  • Intervention node
  • Intervention node
  • More general interventions
  • Much easier to implement
  • To simulate arc cutting P(C ?c, Ic)1
  • Arc cutting isnt general enough

15
Defining Intervention II
  • We define an intervention on model M as
  • M augmented with Ic (M') where
  • Ic has the purpose of manipulating C
  • Ic is exogenous (has no parents) in M'
  • Ic directly causes (is a parent of) C
  • To preserve the original network
  • PM'(C ?c, Ic) PM' (C ?c)
  • where ?c are the original parents of C.
  • We also define P(C) as the intended distribution.

16
Varieties of Intervention Dependency
  • The degree of dependency of the effect upon
    existing parents.
  • An independent intervention cuts the child off
    from its other parents. Thus,
  • PM'(C ?c, Ic) P(C)
  • A dependent intervention allows any parent
    interaction.

17
Varieties of Intervention Indeterminism
  • The degree of indeterminism of the effect.
  • A deterministic intervention sets the child to
    one particular state.
  • A stochastic intervention sets the child to a
    positive distribution.
  • Dependency and Determinism
  • characterize any intervention
  • Pearlian interventions are independent and
    deterministic

18
Varieties of Intervention Effectiveness
  • We've found the idea of effectiveness useful.
  • If P(C) is what's intended and r is the
    effectiveness, then
  • PM'(C ?c, Ic) r P(C) (1-r) PM'(C
    ?c)
  • This is a dependent intervention.

19
Demo of Causal Intervention Software
20
Summary of Causal Intervention
  • A taxonomy of intervention types
  • More realistic interventions (e.g., partial
    effectiveness)
  • A GUI which handles some varieties of
    intervention
  • Pearlian
  • Partially effective
  • Extensible to deal with other types of
    interaction explicitly

21
Learning Causal Structure
  • This is the real problem parameterizing models
    is relatively straightforward estimation problem.
  • Size of the dag space is superexponential
  • Number of possible orderings n!
  • Times number of possible arcs Cn2
  • Minus number of possible cyclic graphs
  • More exactly (Robinson, 1977)
  • f(n) ?(-1)i1 Cni 2i(n-i)f(n-i)
  • so for
  • n3, f(n)25
  • n5, f(n)25,000
  • n10, f(n) ? 4.2x1018

22
Learning Causal Structure
  • There are two basic methods
  • Learning from conditional independencies (CI
    learning)
  • Learning using a scoring metric (Metric learning)
  • CI learning (Verma and Pearl, 1991)
  • Suppose you have an Oracle who can answer yes or
    no to any question of the type
  • is X conditional independence Y given S?
  • Then you can learn the correct causal model, up
    to statistical equivalence (patterns).

23
Statistical Equivalence
  • Two causal models H1 and H2 are statistically
    equivalent iff they contain the same variables
    and joint samples over them provide no
    statistical grounds for preferring one over the
    other.
  • Examples
  • All fully connected models are equivalent.
  • A ? B ? C and A ? B ? C.
  • A ? B ? D ? C and A ? B ? D ? C.

24
Statistical Equivalence (cont.)
  • (Verma and Pearl, 1991) Any two causal models
    over the same variables which have the same
    skeleton (undirected arcs) and the same directed
    v-structures are statistically equivalent.
  • Chickering (1995) If H1 and H2 are statistically
    equivalent, then they have the same maximum
    likelihoods relative to any joint samples
  • max P(eH1,?1) max P(eH2,?2)
  • where ?i is a parameterization of Hi

25
Other approaches to structure learning
  • TETRAD II Spirtes, Glymour and Scheines (1993).
    Implemented in their PC algorithm
  • Doesn't handle well with weak links and small
    samples (demonstrated empirically in Dai, Korb,
    Wallace Wu (1997)).
  • Bayesian LBN Cooper Herskovits' K2 (1991,
    1992)
  • Compute P(hie) by brute force, under the various
    assumptions which reduce the computation of
    PCH(h,e) to a polynomial time counting problem.
  • But the hypothesis space is exponential they go
    for dramatic simplification by assuming we know
    the temporal ordering of the variables.

26
Learning Variable Order
  • Reliance upon a given variable order is a major
    drawback to K2
  • And many other algorithms (Buntine 1991, Bouckert
    1994, Suzuki 1996, Madigan Raftery 1994)
  • What's wrong with that?
  • We want autonomous AI (data mining). If experts
    can order the variables they can likely supply
    models.
  • Determining variable ordering is half the
    problem. If we know A comes before B, the only
    remaining issue is whether there is a link
    between the two.
  • The number of orderings consistent with dags is
    exponential (Brightwell Winkler 1990 number
    complete). So iterating over all possible
    orderings will not scale up.

27
Statistical Equivalence Learners
  • Heckerman Geiger (1995) advocate learning only
    up to statistical equivalence classes (a la
    TETRAD II).
  • Since observational data cannot distinguish btw
    equivalent models, there's no point trying to go
    further.
  • ? Madigan, Andersson, Perlman Volinsky (1996)
    follow this advice, use uniform prior over
    equivalence classes.
  • ? Geiger and Heckerman (1994) define Bayesian
    metrics for linear and discrete equivalence
    classes of models (BGe and BDe)

28
Statistical Equivalence Learners
  • Wallace Korb (1999) This is not right!
  • These are causal models they are distinguishable
    on experimental data.
  • Failure to collect some data is no reason to
    change prior probabilities.
  • E.g., If your thermometer topped out at 35C,
    you wouldn't treat ? 35C and 34C as equally
    likely.
  • Not all equivalence classes are created equal
  • A ? B ? C, A ? B ? C, A ? B ? C
  • A ? B ? C
  • Within classes some dags should have greater
    priors than others E.g.,
  • LightsOn ? InOffice ? LoggedOn v.
  • LightsOn ? InOffice ? LoggedOn

29
Full Causal Learners
  • So a full causal learner is an algorithm that
  • Learns causal connectedness.
  • Learns v-structures. Hence, learns equivalence
    classes.
  • Learns full variable order. Hence, learns full
    causal structure (order connectedness).
  • TETRAD II 1, 2.
  • Madigan et al. Heckerman Geiger (BGe, BDe) 1,
    2.
  • Cooper Herskovits' K2 1.
  • Lam and Bacchus MDL 1, 2 (partial), 3 (partial).
  • Wallace, Neil, Korb MML 1, 2, 3.

30
CaMML
  • Minimum Message Length (Wallace \ Boulton 1968)
    uses Shannon's measure of information
  • I(m) - log P(m)
  • Applied in reverse, we can compute P(h,e) from
    I(h,e).
  • Given an efficient joint encoding method for the
    hypothesis evidence space (i.e., satisfying
    Shannon's law), MML
  • Searches hi for that hypothesis h that
    minimizes I(h) I(eh).
  • Applies a trade-off between
  • Model simplicity
  • Data fit
  • Equivalent to that h that maximizes P(h)P(eh)
    --- i.e., P(he).

31
MML search algorithms
  • MML metrics need to be combined with search.
    This has been done three ways
  • Wallace, Korb, Dai (1996) greedy search
    (linear).
  • Brute force computation of linear extensions
    (small models only)
  • Neil and Korb (1999) genetic algorithms
    (linear).
  • Asymptotic estimator of linear extensions
  • GA chromosomes causal models
  • Genetic operators manipulate them
  • Selection pressure is based on MML
  • Wallace and Korb (1999) MML sampling (linear,
    discrete).
  • Stochastic sampling through space of totally
    ordered causal models
  • No counting of linear extensions required

32
Empirical Results
  • A weakness in this area --- and AI generally.
  • Papers based upon very small models, loose
    comparisons.
  • ALARM often used --- everything gets it to within
    1 or 2 arcs.
  • Neil and Korb (1999) compared CaMML and BGe
    (Heckerman Geiger's Bayesian metric over
    equivalence classes), using identical GA search
    over linear models
  • On KL distance and topological distance from the
    true model, CaMML and BGe performed nearly the
    same.
  • On test prediction accuracy on strict effect
    nodes (those with no children), CaMML clearly
    outperformed BGe.

33
Extensions to original CaMML
  • Allow specification of prior on arc
  • ODonnell, Korb, Nicholson
  • Useful for combining expert and automated methods
  • Learning local structure
  • Logit models (Neill, Wallace, Korb)
  • Hybrid networks - CPT or decision trees
    (ODonnell, Allison, Korb, Hope) (Uses MCMC
    search)

34
CaMML
  • Information and executables available at
  • www.datamining.monash.edu.au/software/camml
  • Linear and Discrete versions
  • Weka wrapper available
Write a Comment
User Comments (0)
About PowerShow.com