Causal Inference and Graphical Models presentation

About This Presentation

Transcript and Presenter's Notes

Title: Causal Inference and Graphical Models

1
Causal Inference and Graphical Models

Peter Spirtes
Carnegie Mellon University

2
Overview

Manipulations
Assuming no Hidden Common Causes
From DAGs to Effects of Manipulation
From Data to Sets of DAGs
From Sets of Dags to Effects of Manipulation
May be Hidden Common Causes
From Data to Sets of DAGs
From Sets of DAGs to Effects of Manipulations

3
If I were to force a group of people to smoke one
pack a day, what what percentage would develop
lung cancer?
The Evidence
4
P(Lung cancer yes) 1/2
5
Conditioning on Teeth white yes
P(Lung Cancer yesTeeth white yes) 1/4
6
Manipulating Teeth white yes
7
Manipulating Teeth white yes - After Waiting
P(Lung Cancer yes White teeth yes) 1/2
?
P(Lung Cancer yesWhite teeth yes) 1/4
8
Smoking Decision

Setting insurance rates for smokers -
conditioning
Suppose the Surgeon General is considering
banning smoking?
Will this decrease smoking?
Will decreasing smoking decrease cancer?
Will it have negative side-effects e.g. more
obesity?
How is greater life expectancy valued against
decrease in pleasure from smoking?

9
Manipulations and Distributions

Since Smoking determines Teeth white, P(T,L,R,W)
P(S,L,R,W)
But the manipulation of Teeth white leads to
different results than the manipulation of
Smoking
Hence the distribution does not always uniquely
determine the results of a manipulation

10
Causation

We will infer average causal effects.
We will not consider quantities such as
probability of necessity, probability of
sufficiency, or the counterfactual probability
that I would get a headache conditional on taking
an aspirin, given that I did not take an aspirin
The causal relations are between properties of a
unit at a time, not between events.
Each unit is assumed to be causally isolated.
The causal relations may be genuinely
indeterministic, or only apparently
indeterministic.

11
Causal DAGs

Probabilistic Interpretation of DAGs
A DAG represents a distribution P when each
variable is independent of its non-descendants
conditional on its parents in the DAG
Causal Interpretation of DAGs
There is a directed edge from A to B (relative to
V) when A is a direct cause of B.
An acyclic graph is not a representation of
reversible or feedback processes

12
Conditioning

Conditioning maps a probability distribution and
an event into a new probability distribution
f(P(V),e) ? P(V), where P(Vv) P(Vv)/P(e)

13
Manipulating

A manipulation maps a population joint
probability distribution, a causal DAG, and a set
of new probability distributions for a set of
variables, into a new joint distribution
Manipulating for X1,,Xn ? V
f P(V), population distribution
G, causal DAG
P(X1Non-Descendants(G,X1)),,
manipulated variables
P(XnNon-Descendants(G,Xn))
?
P(V) manipulated distribution
(assumption that manipulations are
independent)

14
Manipulation Notation - Adapting Lauritzen

The distribution of Lung Cancer given the
manipulated distribution of Smoking
P(Lung CancerP(Smoking))
The distribution of Lung Cancer conditional on
Radon given the manipulated distribution of
Smoking
P(Lung CancerRadonP(Smoking))
P(Lung Cancer,RadonP(Smoking))/
P(RadonP(Smoking))
First manipulate, then condition

15
Ideal Manipulations

No fat hand
Effectiveness
Whether or not any actual action is an ideal
manipulation of a variable Z is not part of the
theory - it is input to the theory.
With respect to a system of variables containing
murder rates, outlawing cocaine is not an ideal
manipulation of cocaine usage
It is not entirely effective - people still use
cocaine
It affects murder rates directly, not via its
effect on cocaine usage, because of increased
gang warfare

16
3 Representations of Manipulations

Structural Equation
Policy Variable
Potential Outcomes

17
College Plans

Sewell and Shah (1968) studied five variables
from a sample of 10,318 Wisconsin high school
seniors.
SEX male 0, female 1
IQ Intelligence Quotient, lowest 0, highest
3
CP college plans yes 0, no 1
PE parental encouragement low 0, high 1
SES socioeconomic status lowest 0, highest
3

18
College Plans - A Hypothesis
SES SEX PE
CP IQ
19
Equational Representation

xi f(pai(G), ei)
If the ei are causes of two or more variables,
they must be included in the analysis
There is a distribution over the ei
The equations and the distribution over the ei
determine a distribution over the xi
When manipulating variable to a value, replace
with xi c

20
Policy Variable Representation

P(PE,SES,SEX,IQ,CPpolicy off)
P(PE1policy on) 1
P(SES,SEX,IQ,CP,PE1policyon)
P(CPPEpolicy on)

P(PE,SES,SEX,IQ,CP)
Suppose P(PE1)1
P(SES,SEX,IQ,CP,PE1P(PE))
P(CPPEP(PE))

Pre-manipulation
Post-manipulation
21
From DAG to Effects of Manipulation

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

22
Causal Sufficiency

A set of variables is causally sufficient if
every cause of two variables in the set is also
in the set.
PE,CP,SES is causally sufficient
IQ,CP,SES is not causally sufficient.

23
Causal Markov Assumption

For a causally sufficient set of variables, the
joint distribution is the product of each
variable conditional on its parents in the causal
DAG.
P(SES,SEX,PE,CP,IQ) P(SES)P(SEX)P(IQSES)P(PESE
S,SEX,IQ)P(CPPE)

24
Equivalent Forms of Causal Markov Assumption

In the population distribution, each variable is
independent of its non-descendants in the causal
DAG (non-effects) conditional on its parents
(immediate causes).
If X is d-separated from Y conditional on Z
(written as ltX,YZgt) in the causal graph, then X
is independent of Y conditional on Z in the
population distribution) denoted I(X,YZ)).

25
Causal Markov Assumption

Causal Markov implies that if X is d-separated
from Y conditional on Z in the causal DAG, then X
is independent of Y conditional on Z.
Causal Markov is equivalent to assuming that the
causal DAG represents the population
distribution.
What would a failure of Causal Markov look like?
If X and Y are dependent, but X does not cause Y,
Y does not cause X, and no variable Z causes both
X and Y.

26
Causal Markov Assumption

Assumes that no unit in the population affects
other units in the population
If the natural units do affect each other, the
units should be re-defined to be aggregations of
units that dont affect each other
For example, individual people might be
aggregated into families
Assumes variables are not logically related, e.g.
x and x2
Assumes no feedback

27
Manipulation Theorem - No Hidden Variables

P(PE,SES,SEX,CP,IQP(PE))
P(PE)P(SEX)P(CPPE,SES,IQ)P(IQSES)P(PEpolicyon)
P(PE)P(SEX)P(CPPE,SES,IQ)P(IQSES)P(PE)

SES
SES
Policy
SEX
PE
CP
IQ
28
Invariance

Note that P(CPPE,SES,IQ,policy on)
P(CPPE,SES,IQ,policy off) because the policy
variable is d-separated from CP conditional on
PE,SES,IQ
We say that P(CPPE,SES,IQ) is invariant
An invariant quantity can be estimated from the
pre-manipulation distribution
This is equivalent to one of the rules of the Do
Calculus and can also be applied to latent
variable models

IQ
29
Calculating Effects
IQ
30
From Sample to Sets of DAGs

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

31
From Sample to Population to DAGs

Constraint - Based
Uses tests of conditional independence
Goal Find set of DAGs whose d-separation
relations match most closely the results of
conditional independenc tests

Score - Based
Uses scores such as Bayesian Information
Criterion or Bayesian posterior
Goal Maximize score

32
Two Kinds Of Search
Constraint Score
Use non conditional independence information No Yes
Quantitative comparison of models No Yes
Single test result leads astray Yes No
Easy to apply to latent Yes No
33
Bayesian Information Criterion

D is the sample data
G is a DAG
is the vector of maximum likelihood
estimates of the parameters for DAG G
N is the sample size
d is the dimensionality of the model, which in
DAGs without latent variables is simply the
number of free parameters in the model

34
3 Kinds of Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 1
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
Alternative 3 Alternative 2
35
Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 1

Constraint - Based Alternative 1 violates Causal
Markov Assumption by entailing that SES and IQ
are independent
Score - Based Use a score that prefers a model
that contains the true distribution over one that
does not.

36
Alternative Causal Models
SES
SES
SES
SES
SEX
PE
CP
SEX
PE
CP
IQ
IQ
True Model Alternative 2

Constraint - Based Assume that if Sex and CP are
independent (conditional on some subset of
variables such as PE, SES, and IQ) then Sex and
CP are adjacent - Causal Adjacency Faithfulness
Assumption.
Score - Based Use a score such that if two
models contain the true distribution, choose the
one with fewer parameters. The True Model has
fewer parameters.

37
Both Assumptions Can Be False
Independence holds only for parameters on lower
dimensional surface - Lebesgue measure 0
Independence holds for all values of parameters
Alternative 2
True Model
38
When Not to Assume Faithfulness

Deterministic relationships between variables
entail extra conditional independence
relations, in addition to those entailed by the
global directed Markov condition.
If A ? B ? C, and B A, and C B, then not only
I(A,CB), which is entailed by the global
directed Markov condition, but also I(B,CA),
which is not.
The deterministic relations are theoretically
detectible, and when present, faithfulness should
not be assumed.
Do not assume in feedback systems in equilibrium.

39
Alternative Causal Models
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
True Model Alternative 3

Constraint - Based Alternative 2 entails the
same set of conditional independence relations -
there is no principled way to choose.

40
Alternative Causal Models
SES
SES
SES
SES
SEX
PE
SEX
CP
PE
CP
IQ
IQ
True Model Alternative 2

Score - Based Whether or not one can choose
depends upon the parametric family.
For unrestricted discrete, or linear Gaussian,
there is no way to choose - the BIC scores will
be the same.
For linear non-Gaussian, the True Model will be
preferred (because while the two models entail
the same second order moments, they entail
different fourth order moments.)

41
Patterns

A pattern (or p-dag) represents a set of DAGs
that all have the same d-separation relations,
i.e. a d-separation equivalence class of DAGs.
The adjacencies in a pattern are the same as the
adjacencies in each DAG in the d-separation
equivalence class.
An edge is oriented as A ? B in the pattern if it
is oriented as A ? B in every DAG in the
equivalence class.
An edge is oriented as A ? B in the pattern if
the edge is oriented as A ? B in some DAGs in the
equivalence class, and as A ? B in other DAGs in
the equivalence class.

42
Patterns to Graphs

All of the DAGs in a d-separation equivalence
class can be derived from the pattern that
represents the d-separation equivalence class by
orienting the unoriented edges in the pattern.
Every orientation of the unoriented edges is
acceptable as long as it creates no new
unshielded colliders.
That is A ? B ? C can be oriented as A ? B? C, A
? B ? C, or A ? B ? C, but not as A ? B ? C.

43
Patterns
SES
SES
SEX
PE
CP
IQ
D-separation Equivalence Class
SES
SES
SEX
PE
CP
IQ
Pattern
44
Search Methods

Constraint Based
PC (correct in limit)
Variants of PC (correct in limit, better on small
sample sizes)
Score - Based
Greedy hill climbing
Simulated annealing
Genetic algorithms
Greedy Equivalence Search (correct in limit)

45
From Sets of DAGs to Effects of Manipulation

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

46
Causal Inference in Patterns

Is P(IQ) invariant when SES is manipulated to a
constant? Cant tell.
If SES ? IQ, then policy is d-connected to IQ
given empty set - no invariance.
If SES ? IQ, then policy is not d-connected to IQ
given empty set - invariance.

SES
SES
?
policy
SEX
PE
CP
IQ
47
Causal Inference in Patterns

Different DAGs represented by pattern give
different answers as to the effect of
manipulating SES on IQ - not identifiable.
In these cases, should ouput cant tell.
Note the difference from using Bayesian networks
for classification - we can use either DAG
equally well for correct classification, but we
have to know which one is true for correct
inference about the effect of a manipulation.

SES
SES
?
policy
SEX
PE
CP
IQ
48
Causal Inference in Patterns

Is P(CPPE,SES,IQ) invariant when PE is
manipulated to a constant? Can tell.
policy variable is d-separated from CP given PE,
SES, IQ regardless of which way the edge points -
invariance in every DAG represented by the
pattern.

SES
SES
?
SEX
PE
CP
policy
IQ
49
College Plans
not invariant, but is identifiable
invariant
50
Good News
In the large sample limit, there are algorithms
(PC, Greedy Equivalence Search) that are
arbitrarily close to correct (or output cant
tell) with probability 1 (pointwise consistency).

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

51
Bad News
At every finite sample size, every method will be
far from truth with high probability for some
values of the truth (no uniform consistency.)
(Typically not true of classification problems.)

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

52
Why Bad News?
The problem - small differences in population
distribution can lead to big changes in inference
to causal DAGs.

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

53
Strengthening Faithfulness Assumption

Strong versus weak
Weak adjacency faithfulness assumes a zero
conditional dependence between X and Y entails a
zero-strength edge between X and Y
Strong adjacency faithfulness assumes in addition
that a weak conditional dependence between X and
Y entails a weak-strength edge between X and Y
Under this assumption, there are uniform
consistent estimators of the effects of
manipulations.

54
Obstacles to Causal Inference from
Non-experimental Data

unmeasured confounders
measurement error, or discretization of data
mixtures of different causal structures in the
sample
feedback
reversibility
the existence of a number of models that fit the
data equally well
an enormous search space

low power of tests of independence conditional on
large sets of variables
selection bias
missing values
sampling error
complicated and dense causal relations among sets
of variables,
complcated probability distributions

55
From Data to Sets of DAGs - Possible Hidden
Variables

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

56
Why Latent Variable Models?

For classification problems, introducing latent
variables can help get closer to the right answer
at smaller sample sizes - but they are needed to
get the right answer in the limit.
For causal inference problems, introducing latent
variables are needed to get the right answer in
the limit.

57
Score-Based Search Over Latent Models

Structural EM interleaves estimation of
parameters with structural search
Can also search over latent variable models by
calculating posteriors
But there are substantial computational and
statistical problems with latent variable models

58
DAG Models with Latent Variables

Facilitates construction of causal models
Provides a finite search space
Nice statistical properties
Always identified
Correspond to a set of distributions
characterized by independence relations
Have a well-defined dimension
Asymptotic existence of ML estimates

59
Solution

Embed each latent variable model in a larger
model without latent variables that is easier to
characterize.
Disadvantage - uses only conditional independence
information in the distribution.

Latent variable model
Model imposing only independence constraints on
observed variables
Sets of distributions
60
Alternative Hypothesis and Some D-separations
SES
SEX PE
CP L1
L2 IQ
ltL2,SES,L1,SEX, PE?gt ltSEX,L1,SES,L2,IQ?gt ltL1
,SES,L2,SEX?gt ltSEX,CPPE,SES) These entail
conditional independence relations in population.
ltCP,IQ,L1,SEXL2,PE,SESgt ltPE,IQ,L2L1,SEX,
SESgt ltIQ,SEX,PE,CPL1,L2,SESgt ltSES,SEX,IQ,L1
,L2?gt
61
D-separations Among Observed
SES
SEX PE
CP L1
L2 IQ
ltL2,SES,L1,SEX, PE?gt ltSEX,L1,SES,L2,IQ?gt ltL1
,SES,L2,SEX?gt ltSEX,CPPE,SES)
ltCP,IQ,L1,SEXL2,PE,SESgt ltPE,IQ,L2L1,SEX,
SESgt ltIQ,SEX,PE,CPL1,L2,SESgt ltSES,SEX,IQ,L1
,L2?gt
62
D-separations Among Observed
SES
SEX PE
CP L1
L2 IQ
It can be shown that no DAG with just the
measured variables has exactly the set of
d-separation relations among the observed
variables. In this sense, DAGs are not closed
under marginalization.
63
Mixed Ancestral Graphs

Under a natural extension of the concept of
d-separation to graphs with ?, MAG(G) is a
graphical object that contains only the observed
variables, and has exactly the d-separations
among the observed variables.

SES
SEX PE
CP IQ
SES
SEX PE
CP IQ
L1
L2
Latent Variable DAG Corresponding MAG
64
Mixed Ancestral Graph Construction

There is an edge between A and B if and only if
for every ltA,BCgt, there is a latent variable
in C.
If A and B are adjacent, then A ? B if and only
if A is an ancestor of B.
If A and B are adjacent, then A ? B if and only
if A is not an ancestor of B and B is not an
ancestor of A.

65
Suppose SES Unmeasured

SEX PE
CP IQ
SES
SEX PE
CP IQ
L1
L2
DAG
Corresponding MAG

SEX PE
CP IQ
Another DAG with the same MAG
L1
L2
66
Mixed Ancestral Models

Can score and evaluate in the usual ways
Not every parameter is directly interpreted as a
structural (causal) coefficient
Not every part of marginal manipulated model can
be predicted from mixed ancestral graph
Because multiple DAGs can have the same MAG, they
might not all agree on the effect of a
manipulation.
It is possible to tell from the MAG when all of
the DAGs with that MAG all agree on the effect of
a manipulation.

67
Mixed Ancestral Graph

Mixed ancestral models are closed under
marginalization.
In the linear normal case, the parameterization
of a MAG is just a special case of the
parameterization of a linear structural equation
model.
There is a maximum liklihood estimator of the
parameters (Drton).
The BIC score is easy to calculate.
In the discrete case, it is not known how to
parameterize a MAG - some progress has been made.

68
Some Markov Equivalent Mixed Ancestral Graphs
SEX PE
CP IQ
SEX PE
CP IQ
SEX PE
CP IQ
SEX PE
CP IQ
These different MAGs all have the same
d-separation relations.
69
Partial Ancestral Graphs
SEX PE
CP IQ
SEX PE
CP IQ

SEX PE
CP IQ
o
o
o
o
SEX PE
CP IQ
SEX PE
CP IQ
o
o
Partial Ancestral Graph
70
Partial Ancestral Graph represents MAG M

A is adjacent to B iff A and B are adjacent in M.
A ? B iff A is an ancestor of B in every MAG
d-separation equivalent to M.
A ? B iff A and B are not ancestors of each other
in every MAG d-separation equivalent to M.
A o? B iff B is not an ancestor of A in every MAG
d-separation equivalent to M, and A is an
ancestor of B in some MAGs d-separation
equivalent to M, but not in others.
A o?o B iff A is an ancestor of B in some MAGs
d-separation equivalent to M, but not in others,
and B is an ancestor of A in some MAGs
d-separation equivalent to M, but not in others.

71
Partial Ancestral Graph

Partial Ancestral Graph
represents ancestor features common to MAGs that
are d-separation equivalent
d-separation relations in the d-separation
equivalence class of MAGs.
Can be parameterized by turning it into a mixed
ancestral graph
Can be scored and evaluated like MAG

72
FCI Algorithm

In the large sample limit, with probability 1,
the output is a PAG that represents the true
graph over O
If the algorithm needs to test high order
conditional independence relations then
Time consuming - worst case number of
conditional independence tests (complete PAG)
Unreliable (low power of tests)
Modified versions can halt at any given order of
conditional independence test, at the cost of
more Cant tell answers.
Not useful information when each pair of
variables have common hidden cause.
There is a provably correct score-based search,
but it outputs cant tell in most cases

73
Output for College Plans
SES
SEX PE
CP IQ
SES
SEX PE
CP IQ
o
o
o
o
o
o
Output of FCI Algorithm PAG
Corresponding to Output of PC
Algorithm These are different because no DAG can
represent the d-separations in the output of the
FCI algorithm.
74
From Sets of DAGs to Effects of Manipultions -
May Be Hidden Common Causes

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

75
Manipulation Model for PAGs

A PAG can be used to calculate the results of
manipulations for which every DAG represented by
the PAG gives the same answer.
It is possible to tell from the PAG that the
policy variable for PE is d-separated from CP
given PE. Hence P(CPPE) is invariant.

SES
SEX PE
CP IQ
o
o
76
Comparison with non-latent case

FCI
P(cppeP(PE)) P(cppe).
P(CP0PE0P(PE)) .063
P(CP1PE0P(PE)) .937
P(CP0PE1P(PE)) .572
P(CP1PE1P(PE)) .428
PC
P(CP0PE0P(PE)) .095
P(CP1PE0P(PE)) .905
P(CP0PE1P(PE)) .484
P(CP1PE1P(PE)) .516

77
Good News
In the large sample limit, there is an algorithm
(FCI) whose output is arbitrarily close to
correct (or output cant tell) with probability
1 (pointwise consistency).

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

78
Bad News
At every finite sample size, every method will be
arbitrarily far from truth with high probability
for some values of the truth (no uniform
consistency.)

Effect of Manipulation
Causal DAGs Background Knowledge
Causal
Axioms, Prior
Population Distribution
Sampling and
Distributional
Sample Assumptions, Prior

79
Other Constraints

The disadvantage of using MAGs or FCI is they
only use conditional independence information
In the case of latent variable models, there are
constraints implied on the observed margin that
are not conditional independence relations,
regardless of the family of distributions
These can be used to choose between two different
latent variable models that have the same
d-separation relations over the observed
variables
In addition, there are constraints implied on the
observed margin that are particular to a family
of distributions

80
Examples of Open Questions

Complete non-parametric manipulation calculations
for partially known DAGs with latent variables
Define strong faithfulness for the latent case.
Calculating constraints (non-parametric or
parametric) from latent variable DAGs
Using constraints (non-parametric or parametric)
to guide search for latent variable DAGs
Latent variable score-based search over PAGs
Parameterizations of MAGs for other families of
distsributions
Completeness of do-calculus for PAGs
Time series inference

81
Introductory Books on Graphical Causal Inference

Causation, Prediction, and Search, by P. Spirtes,
C. Glymour, R. Scheines, MIT Press, 2000.
Causality Models, Reasoning, and Inference by J.
Pearl, Cambridge University Press, 2000.
Computation, Causation, and Discovery (Paperback)
, ed. by C. Glymour and G. Cooper, AAAI Press,
1999.

Write a Comment

User Comments (0)

About PowerShow.com

Causal Inference and Graphical Models PowerPoint PPT Presentation