Towards Best Methodological Practice: Applying CANCEIS for the Imputation of Continuous Data - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Towards Best Methodological Practice: Applying CANCEIS for the Imputation of Continuous Data

Description:

Introduce similar patterns of missingness as observed in original data (punched-set) ... Choose a baseline imputation strategy and impute punched data ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 43
Provided by: wil64
Category:

less

Transcript and Presenter's Notes

Title: Towards Best Methodological Practice: Applying CANCEIS for the Imputation of Continuous Data


1
Towards Best Methodological Practice Applying
CANCEIS for the Imputation of Continuous Data
  • Steve Rogers, Heather Wagstaff and
  • Vanessa Fearn, Office for National Statistics

2
Overview
  • Part of an ongoing research programme
    Identifying best practice Imputation of social
    surveys
  • Focus on imputing earnings from paid employment
    gross, net, pay period (e.g., wk, mnth, yr,
    etc.)
  • Significant contributor to estimates of
    wealth/poverty national pop. sub-pop. (age,
    gender, ethnicity, region etc.)
  • Bias introduced by a poor imputation strategy can
    have serious consequences
  • Establishing best practice is not straightforward

3
Overview
  • Methodological choices A flavour
  • Micro simulation An experimental platform to aid
    decision making
  • Early results A benchmark for continuing
    research
  • Where next?

4
Methodological choices
  • First principles The best statistical framework?
  • Simplest case cross sectional data two primary
    methods
  • Linear regression estimators
  • Modelling approach particularly suited to
    continuous data
  • Donor based methods
  • Approach based on borrowing values from donors
    with similar characteristics usually associated
    with categorical data

5
Methodological choices
  • Key aim 1 preserve relationship between gross,
    net, and period
  • Trade-off 1

6
Methodological choices
  • Key aim 1 preserve relationship between gross,
    net, and period
  • Trade-off 1
  • Linear regression estimators
  • Typically, impute 1 variable at a time
  • Need several models
  • 1. including gross when net was missing
  • 2. including net when gross was missing 3.
    Period, Joint patterns of missingness
  • Complex strategy - would have to break up data
    (gtresources risk of contamination)

7
Methodological choices
  • Key aim 1 preserve relationship between gross,
    net, and period
  • Trade-off 1
  • Donor based methods
  • Strategies to deal with continuous data
  • Impute all variables simultaneously
  • Easy to preserve at least, the directionality of
    relationship between gross net (gross gt net)

8
Methodological choices
  • Any method A problem with imputing period
  • Example
  • Person (a) net 500 week
  • Person (b) net 500 year
  • If period is missing a poor imputation strategy
    can lead to gross under/over estimates of yearly
    disposable earnings

9
Methodological choices
  • Any method A problem with imputing period
  • Example
  • Person (a) net 500 week
  • Person (b) net 500 year
  • If period is missing a poor imputation strategy
    can lead to gross under/over estimates of yearly
    disposable earnings
  • Tests with typical earnings data A systematic
    bias
  • Overestimates 23k pp/py, Underestimates 7k
    pp/py

10
Methodological choices
  • Key aim 2 Accurate imputation for
    sub-populations in the data
  • Earnings are usually related to Age, gender,
    SOC, SIC, Education, Hours worked, Household
    size/structure, Tenure, etc.

11
Methodological choices
  • Key aim 2 Accurate imputation for
    sub-populations in the data
  • Earnings are usually related to Age, gender,
    SOC, SIC, Education, Hours worked, Household
    size/structure, Tenure, etc.
  • Accuracy of all statistical imputation
    strategies depends on having a sufficiently large
    sample
  • Social surveys Overall - relatively good sample
    size
  • Income Section n 8,858
  • Missing - gross 6, net 4, period lt0.5

12
Methodological choices
  • Trade-off 2
  • The higher we aim for sub-population accuracy
    the smaller our effective sample
  • Example
  • Age_group 2, Gender 2 For sub-population
    accuracy - Overall sample split into 4 cells
  • Age group 4, Gender 2, SOC 25,
  • Gross or net if present 10k income bands
    20
  • Overall sample divided into 4000 cells!
  • Where do we draw the line?
  • Which sub-populations are most important?

13
Methodological choices
  • Trade-off 2
  • The higher we aim for sub-population accuracy
    the smaller our effective sample
  • Example If matching variables
  • Age group 2, Gender 2 For sub-population
    accuracy - Overall sample split into 4 cells
    impute within cell
  • Age group 4, Gender 2, SOC 25,
  • Gross or net if present 10k income bands
    20
  • Overall sample divided into 4000 cells!
  • Where do we draw the line?
  • Which sub-populations are most important?

14
Methodological choices
  • Trade-off 2
  • The higher we aim for sub-population accuracy
    the smaller our effective sample
  • Example If matching variables
  • Age group 2, Gender 2 For sub-population
    accuracy - Overall sample split into 4 cells
    impute within cell
  • Age group 4, Gender 2, SOC 25,
  • Gross or net if present 10k income bands
    20
  • Overall sample divided into 4000 cells!
  • Where do we draw the line?
  • Which sub-populations are most important?

15
Methodological choices
  • Trade-off 2
  • The higher we aim for sub-population accuracy
    the smaller our effective sample
  • Example If matching variables
  • Age group 2, Gender 2 For sub-population
    accuracy - Overall sample split into 4 cells
    impute within cell
  • Age group 4, Gender 2, SOC 25,
  • Gross or net if present 10k income bands
    20
  • Overall sample divided into 4000 cells!
  • Where do we draw the line?
  • Which sub-populations are most important?

16
Methodological choices
  • Identifying best practice is a balancing act!

17
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)

18
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)

19
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)
  • Introduce similar patterns of missingness as
    observed in original data (punched-set)

20
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)
  • Introduce similar patterns of missingness as
    observed in original data (punched-set)
  • Choose a baseline imputation strategy and impute
    punched data

21
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)
  • Introduce similar patterns of missingness as
    observed in original data (punched-set)
  • Choose a baseline imputation strategy and impute
    punched data
  • Because values for missing data in the synthetic
    data set is known, the performance of the
    baseline strategy can be evaluated

22
Not knowing the truth..
  • Alternative theoretical and practical methods are
    typically tested in a micro-simulation
    environment
  • Generate a synthetic data-set from observed clean
    records in data in question (truth-set)
  • Introduce similar patterns of missingness as
    observed in original data (punched-set)
  • Choose a baseline imputation strategy and impute
    punched data
  • Because values for missing data in the synthetic
    data set is known, the performance of the
    baseline strategy can be evaluated
  • Performance of alternative methods can be
    measured against the baseline benchmark

23
Current Research Synthetic data
  • Income data from a typical ONS social survey
  • Income block n 8,858
  • Missing - gross 6, net 4, period lt0.5

24
Current Research Synthetic data
  • Basic set of matching variables (8 Cells
    2x2x2)

25
Current Research Synthetic data
  • Measured item-level missingness within cell
  • With just 8 cells need to tread cautiously
  • ( missing okay, n total within some cells
    very low)

26
Current Research Synthetic data
  • Retained all clean records randomly punched out
    observed missing within cell
  • To avoid bias generated 20 synthetic datasets
  • missing at random

27
Baseline
  • Canceis (SatsCan) Donor based system
    implemented in imputation of several ONS social
    Surveys
  • Modular system with raft of user definable
    parameters ideal for testing alternative
    methods and strategies

28
Baseline
  • Canceis (SatsCan) Donor based system
    implemented in imputation of several ONS social
    Surveys
  • Modular system with raft of user definable
    parameters ideal for testing alternative
    methods and strategies
  • Matching variables
  • Hard constraint 5k income brackets (for gross
    where net was missing and visa-versa)
  • Soft constraint Basic set (age2, SOC2,
    Gender2)

29
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Expectations None!

30
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Expectations None!

31
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Expectations None!

32
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Expectations None!

33
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Expectations None!

34
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • In example Max. observed in range could be up to
    34!
  • Expectations None!

35
Baseline
  • Period Still working on this! weighted all
    data up to a year missingness in gross net
    only
  • Relationship between gross net
  • Where 1 variable only was missing (gt97)
  • Borrow difference between gross net from
    donor rather than value-
  • Imputing difference seems a better option
  • Expectations None!

36
Baseline Results
  • Analyses
  • Distributional Accuracy
  • Predictive Accuracy
  • Level of analyses
  • Variable Aggregate (gross, net)
  • Univariate (age, gender, SOC)
  • Joint distributions
  • Top end of user requirements Distributional
    accuracy Aggregate Univaraite levels
  • Simple measure mean deviation of difference
    between imputed and observed data

37
Baseline Results
38
Baseline Results
39
Baseline Results
40
Baseline Results
41
Where next?
  • A promising start a good performance for net
    (disposable income) good performance for net
    w.r.t working age group (largest proportion of
    data)
  • Probably do better for gross gender, SOC
  • Current research to identify a better set of
    matching variables
  • Other strategies (trimming, donor selection
    methods, regression estimators, multivariate
    modelling)
  • Longitudinal data

42
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com