Title: Methods and software for editing and imputation: recent advancements at Istat
1Methods and software for editing and imputation
recent advancements at Istat
- M. Di Zio, U. Guarnera, O. Luzi, A. Manzari
- ISTAT Italian Statistical Institute
UN/ECE Work Session on Statistical Data
Editing Ottawa, 16-18 May 2005
2Outline
- Introduction
- Editing Finite Mixture Models for continuous
data - Imputation Bayesian Networks for categorical
data - Imputation Quis system for continuous data
- EI Data Clustering for improving the search of
donors in the Diesis system
3Recent advancements at Istat
- In order to reduce waste of resources and to
disseminate best practices, efforts were
addressed in two directions - identifying methodological solutions for some
common types of errors - providing survey practitioners with generalized
tools in order to facilitate the adoption of new
methods and increase the processes standardization
4EditingIdentifying systematic unity measure
errors (UME)
- A UME occurs when the true value of a variable
Xj is reported in a wrong scale (e.g. Xj C,
C100, C1,000, and so on)
5- Finite Mixture Models of Normal Distributions
- Probabilistic clustering based on the assumption
that observations are from a mixture of a finite
number of populations or groups Gg in various
proportions pg - Given some parametric form for the density
function in each group maximum likelihood
estimates can be obtained for the unknown
parameters
6- Finite Mixture Models for UME
- Given q variables X1,.., Xq, the h 2q possible
clusters (mixture components) correspond to
groups of units with different subsets of items
affected by UME (error patterns) - Assuming that valid data are normally distributed
and using a log scale, each cluster is
characterized by a p.d.f. fg(yqt)?MN(mg,S) ,
where mg is translated by a known vector and S
is constant for all clusters - Units are assigned to clusters based on their
posterior probability tg (yiq, p )
7Model diagnostics used to prioritise units for
manual check
- Atypicality Index allows to identify outliers
w.r.t. the defined model (e.g. units possibly
affected by errors other than the UME) - Classification probabilities tg (yiq, p ) allow
to identify possibly misclassified units. They
can be directly used to identify
misclassifications that are possibly influential
on target estimates (significance editing)
8Main findings
- Finite Mixture Modelling allows multivariate and
not hierarchical data analyses. Costs for
developing ad hoc procedures are saved - Finite Mixture Modelling produces highly reliable
automatic data clustering/error localization - Model diagnostics can be used for reducing
editing costs due to manual editing - The approach is robust for moderate departures
from normality - The number of model parameters is limited by the
model constraints on m and S
9ImputationBayesian Neworks for categorical
variables
- The first idea of using BNs for imputation is by
Thibaudeau and Winkler (2002) - Let C1.,Cj be a set of categorical variables
having each a finite set of mutually exclusive
states - BNs allows to represent graphically and
numerically the joint distribution of variables - A Bn can be viewed as a Directed Acyclic Graph,
and - an inferential engine that allow to perform
inferences on distributions parameters
10Graphical representation of BNs
- To each variable C with parents Pa (Cj) there is
attached a conditional probability P(CPa (Cj)) - BNs allow to factorize the joint probability
distribution P(C1,...,Cj) of so that - P(C1.,Cj)?j1,nP(CjPa(Cj))
11BNs and imputation method 1
- Order variables according to their reliability
- Estimate the network conditioned on this order
- Estimate the conditional probabilities for each
node according to (2) - Impute each missing item by a random draw from
its conditional prob. distribution
12BNs and imputation methods 2/3
- In a multivariate context is more convenient to
use not only information coming from parents, but
also from the children. This can be done by
using Markov Blanket (Mb) - Mb(X) Pa(X)Ch(X)Pa(X Children)
- In this case for each node the conditional
probabilities are estimated w.r.t. its Mb
13Main findings
- BNs allow to express the joint probability
distributions with a dramatic decrease of
parameters to be estimated (reduction of
complexity) - BNs may estimate the relationships between
variables that are really informative for
predicting values - Parametric models like BNs are efficient in terms
of preservation of joint distributions - The graphical representation facilitates
modelling - BNs and hot deck methods have the same behaviour
only in the case that the hot deck is stratified
according to variables explaining exactly the
missing mechanism
14ImputationQuis system for continuous variables
- Quis (QUick Imputation System) is a SAS
generalized tool developed at Istat to impute
continuous survey data in a unified environment - Given a set of variables subject to non response,
different methods can be used in a completely
integrated way - Regression Imputation via EM algorithm
- Nearest Neighbour Donor Imputation (NND)
- Multivariate Predictive Mean Matching (PMM)
15Regression imputation via EM
- In the context of imputation, the EM algorithm is
used for obtaining Maximum Likelihood estimates
in presence of missing data for the parameters of
the model assumed for the data - Assumptions
- MAR mechanism
- Normality
16Regression imputation via EM
- Once ML estimates of parameters have been
obtained, missing data can be imputed in two
different ways - directly through expectations of missing values
conditional on observed ones (predictive means) - by adding a normal random residual to the
predictive means (i.e. drawing values from the
conditional distributions of missing values)
17Multivariate Predictive Mean Matching (PMM)
- Let Y (Y1,...Yq) be a set of variables subject
to non response - ML estimates of the parameters q of the joint
distribution of Y are derived via EM - For each pattern of missing data ymiss, the
parameters of the corresponding conditioned
distribution are estimated starting from q (sweep
operator) - For each unit ui the predictive mean based on
estimated parameters is computed - For each unit with missing data, imputation is
done using the nearest donor w.r.t. the
predictive mean - The Mahalanobis distance is adopted to find donors
18Data clustering for improving the search for
donors in the Diesis system
- The DIESIS system has been developed at ISTAT for
treating the demographic variables of the 2001
Population Census - Diesis uses both the data driven and the minimum
change approach for editing and imputation - For each failed household, the set of potential
donors contains only the nearest passed
households - The adopted distance function is a weighted sum
of the distances for each demographic variable
over all the individuals within the household
19The in use approach for donor search
- For each failed household e, the identification
of potential donors should be made by searching
within the set of all passed households D - When D is very large, as in the case of a Census,
the computation of the distance between each e
and all d?D (exhaustive search) could require
unacceptable computational time - The in use sub-optimal search consists in
arresting the search before examining the entire
set D according to some stopping criteria. This
solution does not guarantee the selection of the
potential donors having actual minimum distance
from e
20The new approach for donor search
- In order to reduce the number of passed
households to examine, the set of passed
households D is preliminarily divided into
smaller homogeneous subsets D1, , Dn (D1
??DnD,) - Such subdivision is obtained by solving an
unsupervised clustering problem (donor search
guided by clustering) - The search for the potential donors is then
conducted, for each failed household e, by
examining only the households within the
cluster(s) more similar to e
21Main findings
- The donor search guided by clustering reduces
computational times preserving the EI quality
obtained by the exhaustive search - The donor search guided by clustering increases
the proportion of actual minimum distance donors
selected with respect to the sub-optimal search
(this is especially useful for households having
uncommon structure for which few passed
households are generally available)