Introduction to the Concepts and Methods of Impact Evaluation Martin Ravallion Development Research Group, World Bank - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Introduction to the Concepts and Methods of Impact Evaluation Martin Ravallion Development Research Group, World Bank

Description:

... samples and estimate a logit (or probit) model of program participation. ... for the common impact model: ... Test for whether DDD identifies gain to ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 87
Provided by: world72
Category:

less

Transcript and Presenter's Notes

Title: Introduction to the Concepts and Methods of Impact Evaluation Martin Ravallion Development Research Group, World Bank


1
Introduction to the Concepts and Methods of
Impact EvaluationMartin RavallionDevelopment
Research Group, World Bank
2
  • The type of program considered
  • The evaluation problem
  • Generic issues
  • 4. Single difference randomization
  • Single difference controls for observables
  • Single difference exploiting program design
  • Double difference
  • Higher-order differencing
  • Instrumental variables
  • Making evaluations more useful

3
1. The type of program
  • Assigned programs without spillover effects
  • some units (individuals, households, villages)
    get the program and some do not
  • and the benefits are largely confined to those
    for whom the program is assigned
  • Possible examples
  • Social fund selects from applicants
  • Workfare gains to workers and benefiting
    communities others get nothing
  • Cash transfers to eligible households only
  • Ex-post evaluation
  • But ex post does not mean start late!

4
2. The evaluation problem
  • What do we mean by impact?
  • Impact the difference between the relevant
    outcome indicator with the program and that
    without it.
  • However, we can never observe someone in two
    different states of nature at the same time.
  • While a post-intervention indicator is
    observed, its value in the absence of the program
    is not, i.e., it is a counter-factual.
  • So all evaluation is essentially a problem of
    missing data. Calls for counterfactual analysis.

5
Naïve comparisons can be deceptive
  • Common practices
  • compare outcomes after the intervention to those
    before, or
  • compare units (people, households, villages) that
    receive the program with those that do not.
  • Potential biases from failure to control for
  • Other changes over time under the counterfactual,
    or
  • Unit characteristics that influence program
    placement.

6
Naïve comparison 1 Before vs after.We observe
an outcome indicator,

Intervention
7
and its value rises after the program

Intervention
8
However, we need to identify the counterfactual

Intervention
9
since only then can we determine the impact of
the intervention

10
Naïve comparison 2 With vs.withoutImpacts
on poverty?
Percent not poor
11
Impacts on poverty?
Percent not poor
12

13
How can we do better? The missing-data problem
in evaluation
  • For each unit (person, household, village,)
    there are two possible values of the outcome
    variable
  • The value under the treatment
  • The value under the counterfactual
  • However, we cannot observe both for all units
  • We cannot observe the counterfactual outcomes for
    the treated units
  • Or the outcomes under treatment for the untreated
    units
  • So evaluation is essentially a problem of missing
    data gt counterfactual analysis.

14
Archetypal formulation
15
Archetypal formulation
16
The evaluation problem
17
Alternative solutions 1
  • Experimental evaluation (Social experiment)
  • Program is randomly assigned, so that everyone
    has the same probability of receiving the
    treatment.
  • In theory, this method is assumption free, but in
    practice many assumptions are required.
  • Pure randomization is rare for anti-poverty
    programs in practice, since randomization
    precludes purposive targeting.
  • Although it is sometimes feasible to partially
    randomize.

18
Alternative solutions 2
  • Non-experimental evaluation (Quasi-experimental
    observational studies)
  • One of two (non-nested) conditional independence
    assumptions

1. Placement is independent of outcomes given X
gtSingle difference methods assuming
conditionally exogenous program placement. Or
placement is independent of outcome changes
gtDouble difference methods
2. A correlate of placement is independent of
outcomes given D and X gt Instrumental
variables estimator.
19
  • 3. Generic issues
  • Selection bias
  • Spillover effects

20
Selection bias in the outcome difference between
participants and non-participants
0 with exogenous program placement
21

Two sources of selection bias
  • Selection on observables
  • Data
  • Linearity in controls?
  • Selection on unobservables
  • Participants have latent attributes that yield
  • higher/lower outcomes
  • One cannot judge if exogeneity is plausible
    without knowing whether one has dealt adequately
    with observable heterogeneity.
  • That depends on program, setting and data.

22

Spillover effects
  • Hidden impacts for non-participants?
  • Spillover effects can stem from
  • Markets
  • Non-market behavior of participants/non-participa
    nts
  • Behavior of intervening agents
    (governmental/NGO)
  • Example 1 Poor-area programs
  • Aid targeted to poor villageslocal govt.
    response
  • Example 2 Employment Guarantee Scheme
  • assigned program, but no valid comparison group.

23

4. Randomization Randomized out group reveals
counterfactual
  • As long as the assignment is genuinely random,
    mean impact is revealed
  • ATE is consistently estimated
    (nonparametrically) by the difference between
    sample mean outcomes of participants and
    non-participants.
  • Pure randomization is the theoretical ideal for
    ATE, and the benchmark for non-experimental
    methods.
  • More common randomization conditional on X

24
Examples for developing countries
  • PROGRESA in Mexico
  • Conditional cash transfer scheme
  • 1/3 of the original 500 communities selected were
    retained as control public access to data
  • Impacts on health, schooling, consumption
  • Proempleo in Argentina
  • Wage subsidy training
  • Wage subsidy Impacts on employment, but not
    incomes
  • Training no impacts though selective compliance

25
Lessons from practice 1
  • Ethical objections and political
    sensitivities
  • Deliberately denying a program to those who need
    it and providing the program to some who do not.
  • Yes, too few resources to go around. But is
    randomization the fairest solution to limited
    resources?
  • What does one condition on in conditional
    randomizations?
  • Intention-to-treat helps alleviate these concerns
  • gt randomize assignment, but free to not
    participate
  • But even then, the randomized out group may
    include people in great need.
  • gt Implications for design
  • Choice of conditioning variables.
  • Sub-optimal timing of randomization
  • Selective attrition higher costs

26
Lessons from practice 2
  • Internal validity Selective compliance
  • Some of those assigned the program choose not to
    participate.
  • Impacts may only appear if one corrects for
    selective take-up.
  • Randomized assignment as IV for participation
  • Proempleo example impacts of training only
    appear if one corrects for selective take-up

27
Lessons from practice 3
  • External validity inference for scaling up
  • Systematic differences between characteristics of
    people normally attracted to a program and those
    randomly assigned (randomization bias
    Heckman-Smith)
  • One ends up evaluating a different program to the
    one actually implemented
  • gt Difficult in extrapolating results from a
    pilot experiment to the whole population

28

5. Controls Regression controls and matching
  • 5.1 OLS regression
  • Ordinary least squares (OLS) estimator of impact
    with controls for selection on observables.

29
Even with controls
30

5.2 Matching Matched comparators identify
counterfactual
  • Match participants to non-participants from a
    larger survey.
  • The matches are chosen on the basis of
    similarities in observed characteristics.
  • This assumes no selection bias based on
    unobservable heterogeneity.
  • Mean impact on the treated (ATE or ATET) is
    nonparametrically identified.

31
Propensity-score matching (PSM) Match on the
probability of participation.
  • Ideally we would match on the entire vector X of
    observed characteristics. However, this is
    practically impossible. X could be huge.
  • PSM match on the basis of the propensity score
    (Rosenbaum and Rubin)
  • This assumes that participation is independent of
    outcomes given X. If no bias given X then no bias
    given P(X).

32
Steps in score matching
1 Representative, highly comparable, surveys of
the non-participants and participants. 2 Pool
the two samples and estimate a logit (or probit)
model of program participation. Predicted values
are the propensity scores. 3 Restrict
samples to assure common support Failure of
common support is an important source of bias in
observational studies (Heckman et al.)
33
Density of scores for participants
34
Density of scores for non-participants
35
Density of scores for non-participants
36
Steps in PSM cont.,
5 For each participant find a sample of
non-participants that have similar propensity
scores. 6 Compare the outcome indicators.
The difference is the estimate of the gain due to
the program for that observation. 7 Calculate
the mean of these individual gains to obtain the
average overall gain. Various weighting schemes
gt
37
The mean impact estimator
38
Propensity-score weighting
  • PSM removes bias under the conditional exogeneity
    assumption.
  • However, it is not the most efficient estimator.
  • Hirano, Imbens and Ridder show that weighting the
    control observations according to their
    propensity score yields a fully efficient
    estimator.
  • Regression implementation for the common impact
    model
  • with weights of unity for the treated units and
  • for the controls.

39
How does PSM compare to an experiment?
  • PSM is the observational analogue of an
    experiment in which placement is independent of
    outcomes
  • The difference is that a pure experiment does not
    require the untestable assumption of independence
    conditional on observables.
  • Thus PSM requires good data.
  • Example of Argentinas Trabajar program
  • Plausible estimates using SD matching on good
    data
  • Implausible estimates using weaker data

40
How does PSM differ from OLS?
  • PSM is a non-parametric method (fully
    non-parametric in outcome space optionally
    non-parametric in assignment space)
  • Restricting the analysis to common support
  • gt PSM weights the data very differently to
    standard OLS regression
  • In practice, the results can look very different!

41
How does PSM perform relative to other methods?
  • In comparisons with results of a randomized
    experiment on a US training program, PSM gave a
    good approximation (Heckman et al. Dehejia and
    Wahba)
  • Better than the non-experimental regression-based
    methods studied by Lalonde for the same program.
  • However, robustness has been questioned (Smith
    and Todd)

42
Lessons on matching methods
  • When neither randomization nor a baseline survey
    are feasible, careful matching is crucial to
    control for observable heterogeneity.
  • Validity of matching methods depends heavily on
    data quality. Highly comparable surveys similar
    economic environment
  • Common support can be a problem (esp., if
    treatment units are lost).
  • Look for heterogeneity in impact average impact
    may hide important differences in the
    characteristics of those who gain or lose from
    the intervention.

43

6. Exploiting program design 1
  • Discontinuity designs
  • Participate if score M lt m
  • Impact
  • Key identifying assumption no discontinuity in
    counterfactual outcomes at m.
  • Strict eligibility rules alone do not make this
    plausible (e.g., geography and local govt.)
  • Fuzzy discontinuities in prob. participation.

44

Exploiting program design 2
  • Pipeline comparisons
  • Applicants who have not yet received program
    form the comparison group
  • Assumes exogeneous assignment amongst applicants
  • Reflects latent selection into the program

45
Lessons from practice
  • Know your program well Program design features
    can be very useful for identifying impact.
  • Know your setting well too Is it plausible that
    outcomes are continuous under the counterfactual?
  • But what if you end up changing the program to
    identify impact? You have evaluated something
    else!

46

7. Difference-in-difference
  • Observed changes over time for non-participants
    provide the counterfactual for participants.
  • Steps
  • Collect baseline data on non-participants and
    (probable) participants before the program.
  • Compare with data after the program.
  • Subtract the two differences, or use a regression
    with a dummy variable for participant.
  • This allows for selection bias but it must be
    time-invariant and additive.

47
  • Outcome indicator t0,1
  • where
  • impact (gain)
  • counterfactual
  • estimate from comparison group

48
Difference-in-difference
Post-intervention difference in outcomes
Baseline difference in outcomes
Or
Gain over time for treatment group
Gain over time for comparison group
49
  • Diff-in-diff
  • if (i) change over time for comparison group
    reveals counterfactual
  • and (ii) baseline is uncontaminated by the
    program

50
Selection bias

Selection bias
51
Diff-in-diff requires that the bias is additive
and time-invariant

52
The method fails if the comparison group is on a
different trajectory

gt DD overestimates impact
53
Or
  • DD underestimates impact
  • Common problem in assessing impacts of
    development projects?

54
Example of poor area programs areas not
targeted yield a biased counter-factual
Not targeted
Income
Targeted
Time
  • The growth process in non-treatment areas is
    not
  • indicative of what would have happened in the
  • targeted areas without the program
  • Example from China (Jalan and Ravallion)

55
  • Matched double difference
  • Matching helps control for time-varying
  • selection bias
  • Score match participants and non-participants
    based on observed characteristics in baseline
  • Initial conditions (incl. outcomes)
  • Prior outcome trajectories
  • Then do a double difference
  • This deals with observable heterogeneity in
    initial conditions that can influence subsequent
    changes over time

56
Propensity-score weighted version of
matched diff-in-diff.
  • Weighting the control observations according to
    their propensity score yields a fully efficient
    estimator (Hirano, Imbens and Ridder).
  • Regression
  • with weights of unity for the treated units and
  • for the controls where
  • is the propensity score.

57
Fixed effects model
  • Fixed effects model on balanced panel
  • where
  • Note
  • Adding picks up any differences
    in time-mean latent factors.
  • One does not require a balanced panel to estimate
    DD.

58
Lessons from practice
  • Single-difference matching can be severely
    contaminated by selection bias
  • Latent heterogeneity in factors relevant to
    participation
  • Tracking individuals over time allows a double
    difference
  • This eliminates all time-invariant additive
    selection bias
  • Combining double difference with matching
  • This allows us to eliminate observable
    heterogeneity in factors relevant to subsequent
    changes over time

59
8. Higher-order differencing
  • Pre-intervention baseline data unavailable
  • e.g., safety net intervention in response to a
    crisis
  • Can impact be inferred by observing participants
    outcomes in the absence of the program after the
    program?

60
New issues
  • Selection bias from two sources
  • 1. decision to join the program
  • 2. decision to stay or drop out
  • There are observed and unobserved characteristics
    that affect both participation and income in the
    absence of the program
  • Past participation can bring current gains for
    those who leave the program

61
Double-Matched Triple Difference
  • 1. Match participants with a comparison group of
    non-participants
  • 2. Match leavers and stayers
  • 3. Compare gains to continuing participants with
    those who drop out (Ravallion et al.)
  • Triple Difference (DDD)
  • DD for stayers DD for leavers

62
  • Outcomes for participants
  • Single difference
  • Double difference
  • Triple difference
  • stayers leavers
  • in period 2 in period 2

63
  • Joint conditions for DDD to estimate impact
  • no current gain to ex-participants
  • no selection bias in who leaves the program

64
Test for whether DDD identifies gain to current
participants
  • Third round of data allows a test mean gains
    in round 2 should be the same whether or not one
    drops out in round 3

Gain in round 2 for stayers in round 3
Gain in round 2 for leavers in round 3
65
Lessons from practice
  • 1. Tracking individuals over time
  • addresses some of the limitations of
    single-difference on weak data
  • allows us to study the dynamics of recovery
  • 2. Baseline can be after the program, but must
    address the extra sources of selection bias
  • 3. Single difference for leavers vs. stayers can
    work well if there is an exogenous program
    contraction

66
9. Instrumental variables Identifying exogenous
variation using a 3rd variable
  • Outcome regression
  • (D 0,1 is our program not random)
  • Instrument (Z) influences participation, but
    does not affect outcomes given participation (the
    exclusion restriction).
  • This identifies the exogenous variation in
    outcomes due to the program.
  • Treatment regression

67
Reduced-form outcome regression where
and Instrumental variables (two-stage
least squares) estimator of impact Or
Predicted D purged of endogenous part.
68
Problems with IVE
  • 1. Finding valid IVs
  • Usually easy to find a variable that is
    correlated with treatment.
  • However, the validity of the exclusion
    restrictions is often questionable.
  • 2. Impact heterogeneity due to latent factors

69
Sources of instrumental variables
  • Partially randomized designs as a source of IVs
  • Non-experimental sources of IVs
  • Geography of program placement (Attanasio and
    Vera-Hernandez) Dams example (Duflo and Pande)
  • Political characteristics (Besley and Case
    Paxson and Schady)
  • Discontinuities in survey design

70
Endogenous compliance Instrumental variables
estimator
  • D 1 if treated, 0 if control
  • Z 1 if assigned to treatment, 0 if not.
  • Compliance regression
  • Outcome regression (intention to treat
    effect)
  • 2SLS estimator (ITT deflated by
    compliance rate)

71
Essential heterogeneity and IVE
  • Common-impact specification is not harmless.
  • Heterogeneity in impact can arise from
    differences between treated units and the
    counterfactual in latent factors relevant to
    outcomes.
  • For consistent estimation of ATE we must assume
    that selection into the program is unaffected by
    latent, idiosyncratic, factors determining the
    impact (Heckman et al).
  • However, likely winners will no doubt be
    attracted to a program, or be favored by the
    implementing agency.
  • gt IVE is biased even with ideal IVs.

72
Stylized example
  • Two types of people (1/2 of each)
  • Type H High impact large gains (G) from program
  • Type L Low impact no gain
  • Evaluator cannot tell which is which
  • But the people themselves can tell (or have a
    useful clue)
  • Randomized pilot
  • Half goes to each type
  • ImpactG/2
  • Scaled up program
  • Type H select into program Type L do not
  • ImpactG

73
IVE is only a local effect
  • IVE identifies the effect for those induced to
    switch by the instrument (local average effect)
  • Suppose Z takes 2 values. Then the effect of
    the program is
  • Care in extrapolating to the whole population
    when there is latent heterogeneity.

74
Local instrumental variables
  • LIV directly addresses the latent heterogeneity
    problem.
  • The method entails a nonparametric regression
    of outcomes Y on the propensity score.
  • The slope of the regression function
    gives the marginal impact at the data point.
  • This slope is the marginal treatment effect
    (Björklund and Moffitt),
  • from which any of the standard impact
    parameters can be calculated (Heckman and
    Vytlacil).

75
Lessons from practice
  • Partially randomized designs offer great source
    of IVs.
  • The bar has risen in standards for
    non-experimental IVE
  • Past exclusion restrictions often questionable in
    developing country settings
  • However, defensible options remain in practice,
    often motivated by theory and/or other data
    sources
  • Future work is likely to emphasize latent
    heterogeneity of impacts, esp., using LIV.

76
10. Making evaluations more useful
  • Evaluations are often not as relevant for
    practitioners as they could be
  • 10 steps to more policy-relevant evaluations

77
Step 1 Make the policy questions the starting
point
  • Start with the questions and remain eclectic on
    methods of answering them
  • Policy relevant evaluations must start with
    interesting and important questions.
  • But instead many evaluators start with a
    preferred method and look for questions that can
    be addressed with that method.
  • By constraining evaluative research to situations
    in which one favorite method is feasible,
    research may exclude many of the most important
    and pressing development questions.
  • Make sure the evaluation process is linked to
    project
  • Evaluator did not know enough about setting and
    project
  • Data collection started too late
  • Data collection did not cover right outcomes and
    did not allow for adequate controls
  • Too many monitoring indicators too few
    outcomes and controls
  • Evaluation did not address, or even ask, the
    right questions!

78
Step 2 Take seriously the ethical objections and
political sensitivities policy makers do!
  • Pilots (using NGOs) can often get away with
    methods not acceptable to governments accountable
    to voters.
  • Deliberately denying a program to those who need
    it and providing the program to some who do not.
  • - Is randomization the fairest solution to
    limited resources?
  • - What does one condition on in conditional
    randomizations?
  • Key problem The information available to the
    evaluator (for conditioning impacts) is a partial
    subset of the information available on the
    ground

79

Step 3 Take a comprehensive approach to the
sources of bias
  • Two sources of selection bias observables and
    unobservables (to the evaluator)
  • Some economists have become obsessed with the
    latter bias, while ignoring enumerable other
    biases/problems.
  • Less than ideal methods of controlling for
    observable heterogeneity including ad hoc models
    of outcomes.
  • Evidence that we have given too little
    attention to the problem of selection bias based
    on observables.
  • Arbitrary preferences for one conditional
    independence assumption (exclusion restrictions)
    over another (conditional exogeneity of
    placement)
  • Cannot scientifically judge appropriate
    assumptions/ methods independently of program,
    setting and data.

80

Step 4 Look for spillover effects
  • Are there hidden impacts for non-participants?
  • Look for signs of spillover effects stemming
    from
  • Markets
  • Behavior of participants/non-participants
  • Behavior of intervening agents
    (governmental/NGO)

81

Step 5 Take a sectoral approach, recognizing
fungibility/flypaper effects
  • Fungibility
  • You are not in fact evaluating what the extra
    public resources (incl. aid) actually financed.
  • So your evaluation may be deceptive about the
    true impact of those resources.
  • Flypaper effects
  • Impacts may well be found largely within the
    sector.
  • Need for a broad sectoral approach

82
Step 6 Look for impact heterogeneity
  • Impacts varies with participant characteristics
    (including those not observed by the evaluator)
    and context.
  • Participant heterogeneity
  • Interaction effects
  • Essential heterogeneity participant responses
  • Implications for
  • evaluation methods
  • project design
  • external validity (generalizability) gt
  • Contextual heterogeneity
  • In certain settings anything works, in others
    everything fails
  • Local institutional factors in development impact
  • Example of Bangladeshs Food-for-Education
    program
  • Same program works well in one village, but fails
    hopelessly nearby

83
Step 7 Take scaling up seriously
  • With scaling up
  • Inputs change
  • Entry effects nature and composition of those
    who sign up changes with scale.
  • Migration responses.
  • Intervention changes
  • Resources effects on the intervention
  • Outcome changes
  • Lags in outcome responses
  • Market responses (partial equilibrium assumptions
    are fine for a pilot but not when scaled up)
  • Social effects/political economy effects early
    vs. late capture.
  • But there has been little work on external
    validity and scaling up.

84
Step 8 Understand what determines impact
  • Replication across differing contexts
  • Example of Bangladeshs FFE
  • inequality etc within village gt outcomes of
    program
  • Implications for sample design gt trade off
    between precision of overall impact estimates and
    ability to explain impact heterogeneity
  • Intermediate indicators
  • Example of Chinas SWPRP
  • Small impact on consumption poverty
  • But large share of gains were saved
  • Qualitative research/mixed methods
  • Test the assumptions (theory-based evaluation)
  • But poor substitute for assessing impacts on
    final outcome

In understanding impact, Step 9 is key gt
85
Step 9 Dont reject theory and structural
modeling
  • Standard evaluations are black boxes they give
    policy effects in specific settings but not
    structural parameters (as relevant to other
    settings).
  • Structural methods allow us to simulate changes
    in program design or setting.
  • However, assumptions are needed. (The same is
    true for black box social experiments.) That is
    the role of theory.
  • PROGRESA example (Attanasio et al. Todd
    Wolpin)
  • Modeling schooling choices using randomized
    assignment for identification
  • Budget-neutral switch from primary to secondary
    subsidy would increase impact

86
Step 10 Develop capabilities for evaluation
within countries
  • Strive for a culture of evidence-based evaluation
    practice.
  • China example Seeking truth from fact role
    of research
  • Evaluation is a natural addition to the roles of
    the governments sample survey unit.
  • Independence/integrity should already be in
    place.
  • Connectivity to other public agencies may be a
    bigger problem.
  • Sometimes a private evaluation capability will
    still be required.
Write a Comment
User Comments (0)
About PowerShow.com