Methods and software for editing and imputation: recent advancements at Istat - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Methods and software for editing and imputation: recent advancements at Istat

Description:

E&I: Data Clustering for improving the search of donors in the Diesis system ... The DIESIS system has been developed at ISTAT for treating the demographic ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: stand194
Learn more at: https://unece.org
Category:

less

Transcript and Presenter's Notes

Title: Methods and software for editing and imputation: recent advancements at Istat


1
Methods and software for editing and imputation
recent advancements at Istat
  • M. Di Zio, U. Guarnera, O. Luzi, A. Manzari
  • ISTAT Italian Statistical Institute

UN/ECE Work Session on Statistical Data
Editing Ottawa, 16-18 May 2005
2
Outline
  • Introduction
  • Editing Finite Mixture Models for continuous
    data
  • Imputation Bayesian Networks for categorical
    data
  • Imputation Quis system for continuous data
  • EI Data Clustering for improving the search of
    donors in the Diesis system

3
Recent advancements at Istat
  • In order to reduce waste of resources and to
    disseminate best practices, efforts were
    addressed in two directions
  • identifying methodological solutions for some
    common types of errors
  • providing survey practitioners with generalized
    tools in order to facilitate the adoption of new
    methods and increase the processes standardization

4
EditingIdentifying systematic unity measure
errors (UME)
  • A UME occurs when the true value of a variable
    Xj is reported in a wrong scale (e.g. Xj C,
    C100, C1,000, and so on)

5
  • Finite Mixture Models of Normal Distributions
  • Probabilistic clustering based on the assumption
    that observations are from a mixture of a finite
    number of populations or groups Gg in various
    proportions pg
  • Given some parametric form for the density
    function in each group maximum likelihood
    estimates can be obtained for the unknown
    parameters

6
  • Finite Mixture Models for UME
  • Given q variables X1,.., Xq, the h 2q possible
    clusters (mixture components) correspond to
    groups of units with different subsets of items
    affected by UME (error patterns)
  • Assuming that valid data are normally distributed
    and using a log scale, each cluster is
    characterized by a p.d.f. fg(yqt)?MN(mg,S) ,
    where mg is translated by a known vector and S
    is constant for all clusters
  • Units are assigned to clusters based on their
    posterior probability tg (yiq, p )

7
Model diagnostics used to prioritise units for
manual check
  • Atypicality Index allows to identify outliers
    w.r.t. the defined model (e.g. units possibly
    affected by errors other than the UME)
  • Classification probabilities tg (yiq, p ) allow
    to identify possibly misclassified units. They
    can be directly used to identify
    misclassifications that are possibly influential
    on target estimates (significance editing)

8
Main findings
  • Finite Mixture Modelling allows multivariate and
    not hierarchical data analyses. Costs for
    developing ad hoc procedures are saved
  • Finite Mixture Modelling produces highly reliable
    automatic data clustering/error localization
  • Model diagnostics can be used for reducing
    editing costs due to manual editing
  • The approach is robust for moderate departures
    from normality
  • The number of model parameters is limited by the
    model constraints on m and S

9
ImputationBayesian Neworks for categorical
variables
  • The first idea of using BNs for imputation is by
    Thibaudeau and Winkler (2002)
  • Let C1.,Cj be a set of categorical variables
    having each a finite set of mutually exclusive
    states
  • BNs allows to represent graphically and
    numerically the joint distribution of variables
  • A Bn can be viewed as a Directed Acyclic Graph,
    and
  • an inferential engine that allow to perform
    inferences on distributions parameters

10
Graphical representation of BNs
  • To each variable C with parents Pa (Cj) there is
    attached a conditional probability P(CPa (Cj))
  • BNs allow to factorize the joint probability
    distribution P(C1,...,Cj) of so that
  • P(C1.,Cj)?j1,nP(CjPa(Cj))

11
BNs and imputation method 1
  • Order variables according to their reliability
  • Estimate the network conditioned on this order
  • Estimate the conditional probabilities for each
    node according to (2)
  • Impute each missing item by a random draw from
    its conditional prob. distribution

12
BNs and imputation methods 2/3
  • In a multivariate context is more convenient to
    use not only information coming from parents, but
    also from the children. This can be done by
    using Markov Blanket (Mb)
  • Mb(X) Pa(X)Ch(X)Pa(X Children)
  • In this case for each node the conditional
    probabilities are estimated w.r.t. its Mb

13
Main findings
  • BNs allow to express the joint probability
    distributions with a dramatic decrease of
    parameters to be estimated (reduction of
    complexity)
  • BNs may estimate the relationships between
    variables that are really informative for
    predicting values
  • Parametric models like BNs are efficient in terms
    of preservation of joint distributions
  • The graphical representation facilitates
    modelling
  • BNs and hot deck methods have the same behaviour
    only in the case that the hot deck is stratified
    according to variables explaining exactly the
    missing mechanism

14
ImputationQuis system for continuous variables
  • Quis (QUick Imputation System) is a SAS
    generalized tool developed at Istat to impute
    continuous survey data in a unified environment
  • Given a set of variables subject to non response,
    different methods can be used in a completely
    integrated way
  • Regression Imputation via EM algorithm
  • Nearest Neighbour Donor Imputation (NND)
  • Multivariate Predictive Mean Matching (PMM)

15
Regression imputation via EM
  • In the context of imputation, the EM algorithm is
    used for obtaining Maximum Likelihood estimates
    in presence of missing data for the parameters of
    the model assumed for the data
  • Assumptions
  • MAR mechanism
  • Normality

16
Regression imputation via EM
  • Once ML estimates of parameters have been
    obtained, missing data can be imputed in two
    different ways
  • directly through expectations of missing values
    conditional on observed ones (predictive means)
  • by adding a normal random residual to the
    predictive means (i.e. drawing values from the
    conditional distributions of missing values)

17
Multivariate Predictive Mean Matching (PMM)
  • Let Y (Y1,...Yq) be a set of variables subject
    to non response
  • ML estimates of the parameters q of the joint
    distribution of Y are derived via EM
  • For each pattern of missing data ymiss, the
    parameters of the corresponding conditioned
    distribution are estimated starting from q (sweep
    operator)
  • For each unit ui the predictive mean based on
    estimated parameters is computed
  • For each unit with missing data, imputation is
    done using the nearest donor w.r.t. the
    predictive mean
  • The Mahalanobis distance is adopted to find donors

18
Data clustering for improving the search for
donors in the Diesis system
  • The DIESIS system has been developed at ISTAT for
    treating the demographic variables of the 2001
    Population Census
  • Diesis uses both the data driven and the minimum
    change approach for editing and imputation
  • For each failed household, the set of potential
    donors contains only the nearest passed
    households
  • The adopted distance function is a weighted sum
    of the distances for each demographic variable
    over all the individuals within the household

19
The in use approach for donor search
  • For each failed household e, the identification
    of potential donors should be made by searching
    within the set of all passed households D
  • When D is very large, as in the case of a Census,
    the computation of the distance between each e
    and all d?D (exhaustive search) could require
    unacceptable computational time
  • The in use sub-optimal search consists in
    arresting the search before examining the entire
    set D according to some stopping criteria. This
    solution does not guarantee the selection of the
    potential donors having actual minimum distance
    from e

20
The new approach for donor search
  • In order to reduce the number of passed
    households to examine, the set of passed
    households D is preliminarily divided into
    smaller homogeneous subsets D1, , Dn (D1
    ??DnD,)
  • Such subdivision is obtained by solving an
    unsupervised clustering problem (donor search
    guided by clustering)
  • The search for the potential donors is then
    conducted, for each failed household e, by
    examining only the households within the
    cluster(s) more similar to e

21
Main findings
  • The donor search guided by clustering reduces
    computational times preserving the EI quality
    obtained by the exhaustive search
  • The donor search guided by clustering increases
    the proportion of actual minimum distance donors
    selected with respect to the sub-optimal search
    (this is especially useful for households having
    uncommon structure for which few passed
    households are generally available)
Write a Comment
User Comments (0)
About PowerShow.com