Methods and software for editing and imputation: recent advancements at Istat - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Methods and software for editing and imputation: recent advancements at Istat

Description:

E&I: Data Clustering for improving the search of donors in the Diesis system ... The DIESIS system has been developed at ISTAT for treating the demographic ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 22

Provided by: stand194

Learn more at: https://unece.org

Category:

more less

Transcript and Presenter's Notes

Title: Methods and software for editing and imputation: recent advancements at Istat

1
Methods and software for editing and imputation
recent advancements at Istat

M. Di Zio, U. Guarnera, O. Luzi, A. Manzari
ISTAT Italian Statistical Institute

UN/ECE Work Session on Statistical Data
Editing Ottawa, 16-18 May 2005
2
Outline

Introduction
Editing Finite Mixture Models for continuous
data
Imputation Bayesian Networks for categorical
data
Imputation Quis system for continuous data
EI Data Clustering for improving the search of
donors in the Diesis system

3
Recent advancements at Istat

In order to reduce waste of resources and to
disseminate best practices, efforts were
addressed in two directions
identifying methodological solutions for some
common types of errors
providing survey practitioners with generalized
tools in order to facilitate the adoption of new
methods and increase the processes standardization

4
EditingIdentifying systematic unity measure
errors (UME)

A UME occurs when the true value of a variable
Xj is reported in a wrong scale (e.g. Xj C,
C100, C1,000, and so on)

Finite Mixture Models of Normal Distributions

Probabilistic clustering based on the assumption
that observations are from a mixture of a finite
number of populations or groups Gg in various
proportions pg
Given some parametric form for the density
function in each group maximum likelihood
estimates can be obtained for the unknown
parameters

Finite Mixture Models for UME

Given q variables X1,.., Xq, the h 2q possible
clusters (mixture components) correspond to
groups of units with different subsets of items
affected by UME (error patterns)
Assuming that valid data are normally distributed
and using a log scale, each cluster is
characterized by a p.d.f. fg(yqt)?MN(mg,S) ,
where mg is translated by a known vector and S
is constant for all clusters
Units are assigned to clusters based on their
posterior probability tg (yiq, p )

7
Model diagnostics used to prioritise units for
manual check

Atypicality Index allows to identify outliers
w.r.t. the defined model (e.g. units possibly
affected by errors other than the UME)
Classification probabilities tg (yiq, p ) allow
to identify possibly misclassified units. They
can be directly used to identify
misclassifications that are possibly influential
on target estimates (significance editing)

8
Main findings

Finite Mixture Modelling allows multivariate and
not hierarchical data analyses. Costs for
developing ad hoc procedures are saved
Finite Mixture Modelling produces highly reliable
automatic data clustering/error localization
Model diagnostics can be used for reducing
editing costs due to manual editing
The approach is robust for moderate departures
from normality
The number of model parameters is limited by the
model constraints on m and S

9
ImputationBayesian Neworks for categorical
variables

The first idea of using BNs for imputation is by
Thibaudeau and Winkler (2002)
Let C1.,Cj be a set of categorical variables
having each a finite set of mutually exclusive
states
BNs allows to represent graphically and
numerically the joint distribution of variables
A Bn can be viewed as a Directed Acyclic Graph,
and
an inferential engine that allow to perform
inferences on distributions parameters

10
Graphical representation of BNs

To each variable C with parents Pa (Cj) there is
attached a conditional probability P(CPa (Cj))
BNs allow to factorize the joint probability
distribution P(C1,...,Cj) of so that
P(C1.,Cj)?j1,nP(CjPa(Cj))

11
BNs and imputation method 1

Order variables according to their reliability
Estimate the network conditioned on this order
Estimate the conditional probabilities for each
node according to (2)
Impute each missing item by a random draw from
its conditional prob. distribution

12
BNs and imputation methods 2/3

In a multivariate context is more convenient to
use not only information coming from parents, but
also from the children. This can be done by
using Markov Blanket (Mb)
Mb(X) Pa(X)Ch(X)Pa(X Children)
In this case for each node the conditional
probabilities are estimated w.r.t. its Mb

13
Main findings

BNs allow to express the joint probability
distributions with a dramatic decrease of
parameters to be estimated (reduction of
complexity)
BNs may estimate the relationships between
variables that are really informative for
predicting values
Parametric models like BNs are efficient in terms
of preservation of joint distributions
The graphical representation facilitates
modelling
BNs and hot deck methods have the same behaviour
only in the case that the hot deck is stratified
according to variables explaining exactly the
missing mechanism

14
ImputationQuis system for continuous variables

Quis (QUick Imputation System) is a SAS
generalized tool developed at Istat to impute
continuous survey data in a unified environment
Given a set of variables subject to non response,
different methods can be used in a completely
integrated way
Regression Imputation via EM algorithm
Nearest Neighbour Donor Imputation (NND)
Multivariate Predictive Mean Matching (PMM)

15
Regression imputation via EM

In the context of imputation, the EM algorithm is
used for obtaining Maximum Likelihood estimates
in presence of missing data for the parameters of
the model assumed for the data
Assumptions
MAR mechanism
Normality

16
Regression imputation via EM

Once ML estimates of parameters have been
obtained, missing data can be imputed in two
different ways
directly through expectations of missing values
conditional on observed ones (predictive means)
by adding a normal random residual to the
predictive means (i.e. drawing values from the
conditional distributions of missing values)

17
Multivariate Predictive Mean Matching (PMM)

Let Y (Y1,...Yq) be a set of variables subject
to non response
ML estimates of the parameters q of the joint
distribution of Y are derived via EM
For each pattern of missing data ymiss, the
parameters of the corresponding conditioned
distribution are estimated starting from q (sweep
operator)
For each unit ui the predictive mean based on
estimated parameters is computed
For each unit with missing data, imputation is
done using the nearest donor w.r.t. the
predictive mean
The Mahalanobis distance is adopted to find donors

18
Data clustering for improving the search for
donors in the Diesis system

The DIESIS system has been developed at ISTAT for
treating the demographic variables of the 2001
Population Census
Diesis uses both the data driven and the minimum
change approach for editing and imputation
For each failed household, the set of potential
donors contains only the nearest passed
households
The adopted distance function is a weighted sum
of the distances for each demographic variable
over all the individuals within the household

19
The in use approach for donor search

For each failed household e, the identification
of potential donors should be made by searching
within the set of all passed households D
When D is very large, as in the case of a Census,
the computation of the distance between each e
and all d?D (exhaustive search) could require
unacceptable computational time
The in use sub-optimal search consists in
arresting the search before examining the entire
set D according to some stopping criteria. This
solution does not guarantee the selection of the
potential donors having actual minimum distance
from e

20
The new approach for donor search

In order to reduce the number of passed
households to examine, the set of passed
households D is preliminarily divided into
smaller homogeneous subsets D1, , Dn (D1
??DnD,)
Such subdivision is obtained by solving an
unsupervised clustering problem (donor search
guided by clustering)
The search for the potential donors is then
conducted, for each failed household e, by
examining only the households within the
cluster(s) more similar to e

21
Main findings

The donor search guided by clustering reduces
computational times preserving the EI quality
obtained by the exhaustive search
The donor search guided by clustering increases
the proportion of actual minimum distance donors
selected with respect to the sub-optimal search
(this is especially useful for households having
uncommon structure for which few passed
households are generally available)

Write a Comment

User Comments (0)