Diagnostics for the evaluation of imputed data - PowerPoint PPT Presentation

About This Presentation
Title:

Diagnostics for the evaluation of imputed data

Description:

Proposed Heuristic. Calculate the proportion of 'true' values recovered in all cases ... Evidence that the heuristic is the right approach: Stuart-Maxwell ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 16
Provided by: heatherw5
Learn more at: https://unece.org
Category:

less

Transcript and Presenter's Notes

Title: Diagnostics for the evaluation of imputed data


1
Diagnostics for the evaluation of imputed data
  • Heather Wagstaff and Steve Rogers
  • Methodology Directorate
  • Office for National Statistics, U.K.

2
Overview of the Presentation
  • The presentation is structured as follows
  • Introduction and background
  • Fully controlled simulation environment
  • Baseline analysis
  • Limitations of selected statistical criteria
  • Description of proposed heuristic
  • Example application of the heuristic
  • Conclusions and future work

3
Introduction and Background
  • Evaluating the quality of the imputation of large
    datasets
  • mainly tabular outputs
  • ensure statistical properties maintained
  • emphasis on distributional accuracy
  • ? predictive accuracy lesser importance.
  • ONS endorsed CANCEIS as corporate edit and
    imputation tool
  • where the data are mainly nominal
  • implementation on household individual surveys
  • Registration of Life Events
  • 2011 Census of Population and Households.

4
Introduction and Background
  • Selected Statistical Evaluation Criteria
  • Distributional accuracy
  • Stuart-Maxwell significance test
  • Predictive accuracy
  • 1. Kappa coefficient
  • 2. Large sample variance of Kappa
  • 3. Proportion of true values recovered

5
Fully Controlled Simulation Environment
  • Construction of synthetic data
  • 1. identify reference data
  • 2. analyse and record statistical properties
  • 3. identify further similar data
  • 2001 Area Classifications
  • 4. construct truth deck to mirror reference
    data
  • hierarchical records approx. 170K households and
    400K persons
  • 5. introduce levels and patterns of missingness
    observed in reference data.

6
Baseline Analysis
7
Limitations of the Statistical Tests
8
Limitations of Statistical Tests
9
Proposed Heuristic
Calculate the proportion of true values
recovered in all cases
10
Example Application of Heuristic
  • Aim is to identify the optimal imputation
    strategy
  • search for groupings amongst variables
  • apply logistic regression to all person level
    variables
  • identify 5 key demographic variables
  • self predicting set
  • but predictive power weak for two
  • subject to repeated imputation and compare to
    baseline.

11
Example Application of Heuristic
12
Example Application of Heuristic
13
Example Application of Heuristic
14
Conclusions and Future Work
  • Evidence of instability in the chosen statistical
    criteria
  • Stuart-Maxwell significance test
  • threshold at which becomes unstable is dependent
    on the marginal values and distribution of
    discordant responses.
  • Kappa Coefficient
  • is based on observed and expected values of
    concordant responses (leading diagonal)
  • hence differing values of Kappa when same
    proportion of records recovered in differing
    tables.

15
Conclusions and Future Work
  • Main aim to construct set of diagnostics to
    facilitate 1000 repetitions of imputation process
    in fully controlled simulation environment.
  • Evidence that the heuristic is the right
    approach
  • Stuart-Maxwell - understand when unstable
  • supported by some categorisation of the
    proportion of true values recovered.
  • Main outcome - need to do more work!!
Write a Comment
User Comments (0)
About PowerShow.com