Propensity Score Matching and Variations on the Balancing Test - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Propensity Score Matching and Variations on the Balancing Test

Description:

'The most obvious limitation at present is that multiple versions of the ... prob(D = 1 | X) = p(X), using a logit or probit model. ... – PowerPoint PPT presentation

Number of Views:532
Avg rating:3.0/5.0
Slides: 21
Provided by: DB286
Category:

less

Transcript and Presenter's Notes

Title: Propensity Score Matching and Variations on the Balancing Test


1
Propensity Score Matching and Variations on the
Balancing Test
  • Wang-Sheng Lee
  • Melbourne Institute of Applied Economic and
    Social Research
  • The University of Melbourne
  • October 27, 2006

2
Definition of the Problem
  • The most obvious limitation at present is that
    multiple versions of the balancing test exist in
    the literature, with little known about the
    statistical properties of each one, or how they
    compare to one another given particular types of
    data.
  • (Smith and Todd, 2005)

3
Preview of Main Findings
  • There is a difference between a before matching
    balancing test and an after matching balancing
    test.
  • Current balancing tests as implemented in the
    literature have poor size properties.
  • Improved balancing tests using non-parametric
    tests are suggested.

4
Propensity Score Matching Methodology
  • Step 1 Estimate the probability of receiving
    treatment,
  • prob(D 1 X) p(X), using a logit or probit
    model.
  • Step 2 Choose matching algorithm (e.g.,
    stratification, nearest neighbour, kernel
    matching, caliper matching etc.) and match on
    p(X).
  • Step 3 Perform matching diagnostics (like the
    balancing test).
  • Step 4 Compare mean outcomes to get the Average
    Treatment Effect on the Treated (ATT).

5
A Matching Diagnostic Balance
  • A balancing test checks if the two groups look
    the same in terms of the Xs after matching on
    p(X).
  • The balancing property of propensity scores
    (Theorem 2, Rosenbaum and Rubin, 1983)
  • X ? D p(X)
  • Given information on p(X), information on X is
    unnecessary for information on D.
  • Does not require any use of the outcome variable,
    so no bias.
  • Balance does not mean we have the correct Xs in
    the model (i.e., it does not equal the CIA).
  • No convenient tests for conditional independence
    exist.

6
Varieties of Balancing Tests
  • Test 1 Test for equality of each covariate mean
    between groups, within strata of p(X) (t-test).
  • (Done after step 1 estimating p(X) on full
    sample)
  • Test 2 Standardised test of differences (of
    normalised covariates) between groups.
  • (Done after step 2 matching on p(X))
  • Test 3 Test for equality of each covariate mean
    between groups (t-test).
  • (Done after step 2 matching on p(X))
  • Test 4 Test for joint equality of all covariate
    means between groups (F-test or Hotelling test).
  • (Done after step 2 matching on p(X))

7
Some Other Alternative Before Matching Balancing
Tests
  • QQ plots.
  • Austin and Mamdani (2006) Imai, King and Stuart
    (2006).
  • Box plots.
  • Austin and Mamdani (2006).
  • Binary response plots (Rubin-Cook scatter plots).
  • Lee (2006a).
  • Undirected graphical models.
  • Lee (2006b).

8
Some Other Alternative After Matching Balancing
Tests
  • Regression test.
  • Smith and Todd (2005).
  • Pseudo R2.
  • Sianesi (2004).

9
Motivating Example NSW Data
  • This experimental data set was used in several
    studies to perform a recovery exercise.
  • See, for example, Dehejia and Wahba (1999, 2002)
    and Smith and Todd (2005).
  • Dehejia and Wahba (1999) conducted test 1,
    performed stratification and nearest neighbour
    matching and obtained similar estimates as the
    experimental estimates.
  • Concluded that balancing test 1 is useful.

10
  • But Dehejia and Wahba (1999) did not conduct
    tests 2 to 4. What happens if they did?
  • After estimating p(X), balance is obtained using
    test 1.
  • After performing kernel matching using the same
    specification of p(X), balance is obtained if we
    use tests 2 to 4.
  • However, after performing nearest neighbour
    matching using the same specification of p(X),
    imbalance is obtained if we use tests 2 to 4.
  • In summary, Dehejia and Wahbas (1999) results
    from nearest neighbour matching that replicated
    the experimental benchmark came from a matched
    sample with imbalanced covariates.

11
  • Which balancing test should be used in practice?
  • Is the within strata t-test (Test 1) useful as a
    specification test for p(X)?
  • Approach of Dehejia and Wahba (using test 1
    together with nearest neighbour matching) still
    used as recently as Diaz and Handa (2006).
  • Issue of multiple testing (e.g., Westfall and
    Young 1993).

12
Monte Carlo Simulations
  • Generating balanced data
  • If the error term in the treatment assignment
    equation is independent of X, then given X and ß
  • D ? X Xß
  • It follows that D ? X logit(Xß) or D ? X
    p(X)

13
Monte Carlo Simulations using Generated Data
  • The simulations
  • assume a T-C ratio of 20-80.
  • assume we know which Xs to use to estimate the
    true propensity score (CIA).
  • vary the number and distribution of covariates
    and the sample size.
  • Test 1 performs terribly in terms of test size.
  • But it seems to work well with a Bonferroni
    correction.
  • Tests 2 to 4 appear to have poor test sizes when
    there are more than 2 covariates.
  • In current practice, researchers often look at
    mean or median values (e.g. mean standardised
    difference) instead of using a one unbalanced
    covariate and youre out rule.

14
Monte Carlo Simulations using NSW Data
  • None of the balancing tests appear to work well.
  • For example, test 1 rejects balance about 20 of
    the time when ? 5.
  • Considered the issue of outliers but dropping
    these observations did not change the results.
  • The only way to make things work appears to be
    dropping difficult to balance covariates.
  • But this is not a satisfactory solution!

15
Permutation Tests
  • Instead of using the t-distribution for tests 1
    and 3, or the Hotelling-distribution for test 4,
    we use permutation distributions instead.
  • A similar approach used in Abadie (2002) in the
    context of the Kolmogorov-Smirnov statistic
    performing poorly in the presence of point
    masses.
  • The basic idea is to rearrange the labels on the
    observations, compute the test statistic and
    repeat many times to obtain the permutation
    distribution of the test statistic.
  • Permutation resampling is done without
    replacement.
  • Monte Carlo simulations using the NSW data show
    balancing tests attain approximately the correct
    test sizes.

16
Power of the Tests
  • What happens when there is an omitted variable in
    estimating p(X)? Using the NSW data, consider
    three DGPS.
  • p(X) contains RE74 and Y contains RE74.
  • p(X) contains RE74 and Y does not contain RE74.
  • p(X) does not contain RE74 and Y contains RE74.
  • Estimate the propensity score using a set of
    variables that excludes RE74 (i.e., omitted
    variable).
  • All DGPs reject balance at approximately the
    chosen size.
  • Balancing tests couldnt detect misspecification
    in p(X).
  • Bias on ATT largest for DGP1.
  • Smaller biases on ATT for DGP2 and DGP3.
  • When CIA not fulfilled, balancing tests with low
    type 1 error rates are of limited use (i.e.,
    Balance ? CIA).

17
Conclusions
  • p(X) is a relative measure and not a permanent ID
    tag or permanent summary index score associated
    with each observation.
  • Matching creates weights that effectively changes
    the composition of the sample.
  • When the sample changes, it changes the nature of
    the balancing hypothesis X ? D p(X).

18
  • Important to distinguish between before matching
    and after matching balancing tests.
  • Test 1 is a before matching test and most
    appropriately used with matching by
    stratification (i.e., ATT is computed using the
    exact same strata as test 1).
  • Tests 2 to 4 are after matching balancing tests
    and most appropriately used with matching
    algorithms that match on p(X).

19
  • The DW test as described in Dehejia and Wahba
    (1999, 2002) has a poor test size when used as a
    before matching test.
  • Conventional t-tests and Hotelling-tests do not
    appear to work well as tests for after matching
    balance.
  • Related to the problem of computing standard
    errors for matching estimators, which is still an
    open problem (i.e., no analytic solution).
  • Balancing tests based on permutation tests appear
    to provide good test sizes.
  • But without fulfilling the CIA, their role as a
    diagnostic is limited.

20
  • Das Ende
Write a Comment
User Comments (0)
About PowerShow.com