# Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications. - PowerPoint PPT Presentation

View by Category
Title:

## Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications.

Description:

### Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration and Territorial Statistics – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 73
Provided by: ist129
Category:
Tags:
Transcript and Presenter's Notes

Title: Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications.

1
Data integration an overview on statistical
methodologies and applications.
• Mauro Scanu
• Istat
• Central Unit on User Needs, Integration and
Territorial Statistics
• scanu_at_istat.it

2
Summary
• In what sense methods for integration are
statistical?
• Record linkage definition, examples, methods,
objectives and open problems
• Statistical matching definition, examples,
methods, objectives and open problems
• Micro integration processing definition,
examples, methods, objectives and open problems
• Other statistical integration methods?

3
Methods for integration 1
• Generally speaking, integration of two data sets
is understood as a single unit integration the
objective is the detection of those records in
the different data sets that belong to the same
statistical unit. This action allows the
reconstruction of a unique record of data that
contains all the unit information collected in
the different data sources on that unit.
• On the contrary lets distinguish two different
objectives - micro and macro
• Micro the objective is the development of a
complete data set
• Macro the objective is the development of an
aggregate (for example, a contingency table)

4
Methods for integration 2
• Further, the methods of integration can be split
in automatic and statistical methods
• The automatic methods take into account a priori
rules for the linkage of the data records
• The statistical methods include a formal
estimation or test procedure that should be
applied on the available data this estimation or
test procedure
• can be chosen according to optimality criteria,
• and are associated with an estimate error.
• This talk restricts the attention on the (micro
and macro) statistical methods of integration

5
Statistical methods
• Classical inference
• There exists a data generating model
• 2) The observed sample is an image of the data
generating model
• 3) We estimate the model from the observed sample

6
Statistical methods of integration
• If a method of integration is used, it is
necessary to include an intermediate phase.
• The final data set is a blurred image of the data
generating model

7
Statistical methods of integration
• Statistical methods for integration can be
organized according to the available input

Input Output Metodo
Two data sets that observe (partially) overlapping groups of units Micro Record linkage
Two independent samples Macro/micro Statistical matching
Sets of estimates from different surveys, that are not coherent Macro Calibration methods Graphical methods
8
• Input two data sets on overlapping sets of
units.
• Problem lack of a unique and correct record
identifier
• Alternative sets of variables that (jointly) are
able to identify units
• Attention variables can have problems!
• Objective the largest number of correct links,
the lowest number of wrong links

9
Book of life
• Dunn (1946) describes record linkage in this
way
• each person in the world creates a book of life.
The book starts with the birth and ends with the
death. Its pages are made up of all the principal
events of life. Record linkage is the name given
to the process of assembling the pages of this
book into one volume. The person retains the same
identity throughout the book. Except for
advancing age, he is the same person
• Dunn (1946) "Record Linkage". American Journal
of Public Health 36 (12) 14121416.

10
When there is the lack of a unique identifier
• If a record identifier is missing or cannot be
used, it is necessary to use the common variables
in the two files.
• The problem is that these variables can be
unstable
• Time changes (age, address, educational level)
• Errors in data entry and coding
• Correct answers but different codification (e.g.
• Missing items

11
• According to Fellegi (1997), the development of
tools for integration is due to the intersection
of these facts
• occasion construction of big data bases
• tool computer
• need new informative needs
• Fellegi (1997) Record Linkage and Public
Policy A Dynamic Evolution. In Alvey, Jamerson
(eds) Record Linkage Techniques, Proceedings of
an international workshop and exposition,
Arlington (USA) 20-21 March 1997.

12
• To have joint information on two or more
variables observed in distinct data sources
• To enumerate a population
• To substitute (parts of) surveys with archives
• To create a list of a population
• Other official statistics objectives (imputation
and editing / to enhance micro data quality to
study the risk of identification of the released
micro data)

13
Example 1 analysis of mortality
• Problem to analyze jointly the risk factors
with the event death.
• The risk factors are observed on ad hoc surveys
(e.g. those on nutrition habits, work conditions,
etc.)
• The event death (after some months the survey
is conducted) can be taken from administrative
archives
• These two sources (survey on the risk factors and
death archive) should be fused so that each
unit observed in the risk factor survey can be
associated with a new dichotomous variable (equal
to 1 if the person is dead and zero otherwise).

14
Example 2 to enumerate a population
• Problem what is the number of residents in
Italy?
• Often the number of residents is found in two
steps, by means of a procedure known as
capture-recapture. This method is usually
applied to determine the size of animal
populations.
• Population census
• Post enumeration survey (some months after the
census) to evaluate Census quality and give an
accurate estimate of the population size
• USA - in 1990 Post Enumeration Survey, in 2000
Accuracy and Coverage Evaluation
• Italy - in 2001 Indagine di Copertura del
Censimento

15
Example 2 to enumerate a population
• The result of the comparison between Census and
post enumeration survey is a 2?2 table

16
Example 2 - to enumerate a population
• For short, for any distinct unit it is necessary
to understand if it was observed
• 1) both in the census and in the PES
• 2) only in the census
• 3) only in the PES
• These three values allow to estimate (with an
appropriate model) the fourth value.

17
Example 3 surveys and archives
• Problem is it possible to use jointly
• At the micro level this means to modify the
questionnaire of a survey dropping those
questions that are already available on some
response burden)
• E.g., for enterprises
• Social security archives, chambers of commerce,

18
Example 4 Creation of a list
• Problem what is the set of the active
enterprises in Italy?
• In Istat, ASIA (Archivio Statistico delle Imprese
Attive) is the most important example of a
creation of a list of units (the active
enterprises in a time instant) fusing different
archives.
• It is necessary to pay attention to
• Enterprises which are present in more than one
archives (deduplication)
• Non active enterprises
• New born enterprises
• transformations (that can lead to a new
enterprise or to a continuation of the previous
one)

19
Example 5 Imputation and editing
• Problem to enhance microdata quality
• Micro Integration in the Netherlands (virtual
census, social statistical data base)
• It will be seen later, when dealing with micro
integration processing

20
Example 6 - Privacy
• Problem does it exist a measure of the degree
of identification of the released microdata?
• In order to evaluate if a method for the
protection of data disclosure is good, it is
possible to compare two datasets (the true and
the protected ones) and detect how many modified
records are easily linked to the true ones.

21
The record linkage techniques are a
multidisciplinary set of methods and practices
• DECISION MODEL CHOICE
• Fellegi Sunter
• exact
• Knowledge based
• Mixed
• SEARCH SPACE REDUCTION
• Sorted Neighbourhood Method
• Blocking
• Hierarchical Grouping

......
......
......
• PRE-PROCESSING
• Conversion of upper/lower cases
• Replacement of null strings
• Standardization
• Parsing
• COMPARISON FUNCTION CHOICE
• Edit distance
• Smith-Waterman
• Q-grams
• Jaro string comparator
• Soundex code
• TF-IDF

Tiziana Tuoto, FCSM 2007, Arlington, November 6
2007
22
Example (Fortini, 2008)
• Census is sometimes associated with a post
enumeration surveys, in order to detect the
actual census coverage.
• To this purpose, a capture-recapture approach
is generally considered.
• It is necessary to find out how many individuals
have been observed
• in both the census and the PES
• Only in the census
• Only in the PES
• These figures allow to estimate how many
individuals have NOT been observed in both the
census and the PES
• In ESSnet Statistical Methodology Project on
Integration of Survey and Administrative Data
Report of WP2. Recommendations on the use of
methodologies for the integration of surveys and

23
Record linkage workflow for Census - PES
Step 1
Step 2
Step 3.a
Step 3.b
Step 4.b
Step 4.a
Step 5
24
Problem Lack of identifiers
• Difference between step 1 and step 2 is that
• Step 1 identifies all those households that
coincide for all these variables
• Name, surname and date of birth of the household
• Number of male and female components
• Step 2 uses the same keys, but admits the
possibility of differences of the variable states
for modifications of errors

25
• For every pairs of records from the two data
sets, it is necessary to estimate
• The probability that the differences between what
observed on the two records is due to chance,
because the two records belong to the same unit
• The probability that the two records belong to
different units
• These probabilities are compared this comparison
is the basis for the decision whether a pair of
records is a match or not
• Estimate of this probability is the statistical
step in the probabilistic record linkage method

26
Statistical step
• Data set A with na units.
• Data set B with nb units.
• K key variables (they jointly make an identifier)

27
Statistical procedure
• The key variables of the two records in a pair
(a,b) is compared
• yabf(xAa,xBb)
• The function f(.) should register how much the
key variables observed in the two units are
different.
• For instance, y can be a vector with k
components, composed of 0s (inequalities) or 1s
(equalities)
• The final result is a data set of na x nb
comparisons

28
Statistical procedure
• The na x nb pairs are split in two sets
• M the pairs that are a match
• U the unmatched pairs
• Likely, the comparisons y will follow this
situation
• Low levels of diversity for the pairs that are
match, (a,b)?M
• High levels of diversity for the pairs that are
non-match, (a,b)?U
• For instance if y(sum of the equalities for the
k key variables), y tends to assume large values
for the pairs in M with respect to those in U

29
Statistical procedure
If y(sum of the equalities), the distribution of
y is a mixture of the distribution of y in M
(right) and that in u (left)
30
Statistical procedure
Inclusion of a pair (a,b) in M or U is a missing
value (latent variable). Let C denote the status
of a pair (C1 if (a,b) in M C0 if (a,b) in
U) Likelihood is the product on the na x nb pairs
of P(Yy, Cc) p m(y)c (1-p)
u(y)(1-c) Estimation method maximum
likelihood on a partially observed data set (EM
algorithm Expectation Maximization)
Parameters data
p fraction of matches among the na x nb pairs Y observed
m(y) distribution of y in M C missing (latent)
u(y) distribution of y in U
31
Statistical procedure
A pair is assigned to M or U in the following
way 1) For every comparison y assign a
weight t(y)m(y)/u(y) where m and u are
estimated 2) Assign the pairs with a large
weight to M and the pairs with a small weight to
U. 3) There can be a class of weights t where it
is better to avoid definitive decisions (m and u
are similar)
32
Statistical procedure
The procedure is the following. Note that,
generally, probabilities of mismatching are still
not considered
33
Open problems
• Different probabilistic record linkage aspects
should still be better investigated. Two of them
are related to record linkage quality
• What model should be considered
• a1) on the pairs relationship (Copas and Hilton,
1990)
• a2) on the key variables relationship
(Thibaudeau, 1993)
• b) How probabilities of mismatching can be used
for a statistical analysis of a linked data file?
(Scheuren and Winkler, 1993, 1997)
• Copas J.R., Hilton F.J. (1990). Record linkage
statistical models for matching computer
records. Journal of the Royal Statistical
Society, Series A, 153, 287-320.
• Thibaudeau Y. (1993). The discrimination power
of dependency structures in record linkage.
Survey Methodology, 19, 31-38.
• Scheuren F., Winkler W.E. (1993). Regression
analysis of data files that are computer
matched. Survey Methodology, 19, 39-58
• Scheuren F., Winkler W.E. (1997). Regression
analysis of data files that are computer matched
- part II. Survey Methodology, 23, 157-165.

34
Statistical matching
• What kind of integration should be considered if
the analysis involves two variables observed in
two independent sample surveys?
• Let A and B be two samples of size nA and nB
respectively, drawn from the same population.
• Some variables X are observed in both samples
• Variables Y are observed only in A
• Variables Z are observed only in B.
• Statistical matching aims at determining
information on (XYZ), or at least on the pairs
of variables which are not observed jointly (YZ)

35
Statistical matching
• It is very improbable that the two samples
observe the same units, hence record linkage is
useless.

36
Some statistical matching applications 1
• The objective of the integration of the Time Use
Survey (TUS) and of the
• Labour Force Survey (LFS) is to create at a micro
level, a synthetic file of
• both surveys that allows the study of the
relationships between variables
• measured in each specific survey.
• By using together the data relative to the
specific variables of both surveys,
• one would be able to analyse the characteristics
of employment and the
• time balances at the same time.
• Information on labour force units and the
organisation of her/his life
• times will help enhance the analyses of the
labour market
• The analyses of the working condition
characteristics that result from
• the labour force survey will integrate the TUS
more general analysis of
• the quality of life

37
Some statistical matching applications 1
• The possibilities for a reciprocal enrichment
have been largely recognised
• (see the 17th International Conference of Labour
Statistics in 2003 and the
• 2003 and 2004 works of the Paris group). The
emphasis was indeed put on
• how the integration of the two surveys could
contribute to analysing the
• different participation modalities in the labour
market determined by hour
• and contract flexibility.
• Among the issues raised by researchers on time
use, we list the following
• two
• the usefulness and limitations involved in using
and combining various
• sources, such as labour force and time-use
surveys, for improving data
• quality
• Time-use surveys are useful, especially for
measuring hours worked of
• workers in the informal economy, in home-based
work, and by the
• hidden or undeclared workforce, as well as to
measure absence from
• work

38
Some statistical matching applications 1
• Specific variables in the TUS (Y ) it enables to
estimate the time
• dedicated to daily work and to study its level of
"fragmentation" (number of intervals/interruptions
), flexibility (exact start and end of working
hours) and intra-relations with the other life
times
• Specific variables in the LFS (Z) The vastness
of the information gathered allow us to examine
the peculiar aspects of the Italian participation
in the labour market professional condition,
economic activity sector, type of working hours,
job duration, profession carried out, etc.
Moreover, it is also possible to investigate
dimensions relative to the quality of the job

39
Some statistical matching applications 2
• The Social Policy Simulation Database and Model
(SPSD/M) is a micro computer-based product
designed to assist those interested in analyzing
the financial interactions of governments and
nglish/spsd/spsdm.htm).
• It can help one to assess the cost implications
or income redistributive
• effects of changes in the personal taxation and
cash transfer system.
• The SPSD is a non-confidential, statistically
representative database of individuals in their
family context, with enough information on each
• individual to compute taxes paid to and cash
• government.

40
Some statistical matching applications 2
• The SPSM is a static accounting model which
processes each individual and family on the SPSD,
calculates taxes and transfers using legislated
or proposed programs and algorithms, and reports
on the results.
• It gives the user a high degree of control over
the inputs and outputs to the model and can allow
the user to modify existing tax/transfer programs
or test proposals for entirely new programs. The
model can be run using a visual interface and it
comes with full documentation.

41
Some statistical matching applications 2
• In order to apply the algorithms for
microsimulation of taxtransfer benefits
• policies, it is necessary to have a data set
• population. This data set should contain
information on structural (age,
• sex,...), economic (income, house ownership, car
ownership, ...), healthrelated (permanent
illnesses, child care,...) social (elder
assistance,
• culturaleducational benefits,...) variables
(among the others).
• It does not exist a unique data set that
contains all the variables that can influence the
fiscal policy of a state
• In Canada 4 samples are integrated (Survey of
consumers finances, Tax return data, Unemployment
insurance claim histories, Family expenditure
survey)
• Common variables some socio-demographic
variables
• Interest is on the relation between the distinct
variables in the different
• samples

42
Example (Coli et al, 2006)
• The new European System of the Accounts (ESA95)
is a detailed source of information on all the
economic agents, as households and enterprises.
The social accounting matrix (SAM) has a relevant
role.
• Module on households it includes the amount of
expenditures and income, per typology of
household
• Coli A., Tartamella F., Sacco G., Faiella I.,
DOrazio M., Di Zio M., Scanu M., Siciliani I.,
Colombini S., Masi A. (2006). La costruzione di
un Archivio di microdati sulle famiglie italiane
ottenuto integrando lindagine ISTAT sui consumi
delle famiglie italiane e lIndagine Banca
dItalia sui bilanci delle famiglie italiane,
Documenti ISTAT, n.12/2006.

43
Example
• Problem
• Income are observed on a Bank of Italy survey
• Expenditures are observed on an Istat survey
• The two samples are composed of different
households, hence record linkage is useless

44
• The first statistical matching solution was
imputation of missing data. Usually, distance
hot deck was used.
• In pratice, this method mimics record linkage
instead of matching records of the same unit,
this approach matches records of similar units,
where similarity is in terms of the common
variables in the two files.
• The procedure is
• 1) Compute the distances between the matching
variables for every pair of records
• 2) Every record in A is associated to that record
in B with minimum distance

45
• The inferential path is the following

46
• It is applied an estimate procedure under
specific models that considers the presence of
missing items. The easiest model is conditional
independence of the never jointly observed
variables (e.g., income and expenditures) given
the matching variables.
• Example
• Y income, Z expenditures, X house surface
• (X,Y,Z) is distributed as a multivariate normal
with parameters
• Mean vector ?
• Variance matrix ?

47
• Estimate the regression equation on A Y??X
• Impute Y in B Yb??Xb , b1,,nB
• Estimate the regression equation in B Z??X
• Impute Z in A Za ??Xa , a1,,nA

48
• The inferential mechanism assumes that
• Y and Z are independent given X
• (there is not the regression coefficient of Z on
Y
• given X)

49
• This method can be applied also with this
inferential scheme the problem is what
hypotheses are before the analysis phase

50
• We do not hypothesize any model. It is estimated
a set of values, one for every plausible model
given the observed data
• Example
• When matching two sample surveys on farms
following contingency table for farms
• Y presence of cattle (FSS)
• Z class of intermediate consumption (from FADN)
• Using the common variables
• X1 Utilized Agricultural Area (UAA) ,
• X2 Livestock Size Unit (LSU)
• X3 geographical characteristics

51
Example
• We consider all the models that we cam estimate
from the observed data in the two surveys
• In practice, the available data allow to say that
the estimate of the number of farms with at least
one cow (Y1) in the lowest class of intermediate
consumption (Z1) is between 2,9 and 4,9

52
Inferential machine
• The inferential machine does not use any specific
model

It is possible to simulate data including
uncertainty on the data generation model (e.g. by
multiple imputation)
53
Quotation (Manski, 1995)
• The pressure to produce answers, without
qualifications, seems particularly intense in the
environs of Washington, D.C. A perhaps
apocryphal, but quite believable, story
circulates about an economists attempt to
describe his uncertainty about a forecast to
President Lyndon Johnson. The economist presented
his forecast as a likely range of values for the
quantity under discussion. Johnson is said to
have replied, Ranges are for cattle. Give me a
number
• Manski, C. F. (1995) Identification problems in
the Social Sciences, Harvard University Press.
• Manski and other authors show that in a wide
range of applied areas (econometrics, sociology,
psychometrics) there is a problem of
identifiability of the models of interest,
usually caused by the presence of missing data.
The statistical matching problem is an example of
this.

54
Why statistical matching?
• Applications in Istat
• SAM
• Joint analysis FADN / FSS
• Joint use of Time Use / Labour force
• Objectives
• Estimates of parameters of not jointly observed
parameters
• Creation of synthetic data (e.g. data set for
microsimulation)

55
Open problems
• Uncertainty estimate (DOrazio et al, 2006)
• Variability of uncertainty (Imbens e Manski,
2004)
• Use of sample drawn according to complex survey
designs (Rubin, 1986 Renssen, 1998)
• Use of nonparametric methods (Marella et al,
2008 Conti et al 2008)
• Conti P.L., Marella D., Scanu M. (2008).
Evaluation of matching noise for imputation
techniques based on the local linear regression
estimator. Computational Statistics and Data
Analysis, 53, 354-365.
• DOrazio M., Di Zio M., Scanu M. (2006).
Statistical Matching for Categorical Data
Displaying Uncertainty and Using Logical
Constraints, Journal of Official Statistics, 22,
137-157.
• Imbens, G.W, Manski, C. F. (2004). "Confidence
intervals for partially identified parameters".
Econometrica, Vol. 72, No. 6 (November, 2004),
18451857
• Marella D., Scanu M., Conti P.L. (2008). On the
matching noise of some nonparametric imputation
procedures, Statistics and Probability Letters,
78, 1593-1600.
• Renssen, R.H. (1998) Use of statistical matching
techniques in calibration estimation. Survey
Methodology 24, 171183.
• Rubin, D.B. (1986) Statistical matching using
file concatenation with adjusted weights and
multiple imputations. Journal of Business and
Economic Statistics 4, 8794.

56
Micro integration processing
• It can be applied every time it is produced a
complete data set (micro level) by any kind of
method. Up to now, applied after exact record
• Micro integration processing consists of putting
in place all the necessary actions aimed to
ensure better quality of the matched results as
quality and timeliness of the matched files. It
includes
• defining checks,
• editing procedures to get better estimates,
• imputation procedures to get better estimates.

57
Micro integration processing
• It should be kept in mind that some sources are
more reliable than others.
• Some sources have a better coverage than others,
and there may even be conflicting information
between sources.
• So, it is important to recognize the strong and
weak points of all the data sources used.

58
Micro integration processing
• Since there are differences between sources, a
micro integration process is needed to check data
and adjust incorrect data. It is believed that
integrated data will provide far more reliable
results, because they are based on an optimal
amount of information. Also the coverage of (sub)
populations will be better, because when data are
missing in one source, another source can be
used. Another advantage of integration is that
users of statistical information will get one
figure on each social phenomenon, instead of a
confusing number of different figures depending
on which source has been used.

59
Micro integration processing
• During the micro integration of the data sources
the following steps have to be taken (Van der
Laan, 2000)
• a. harmonisation of statistical units
• b. harmonisation of reference periods
• c. completion of populations (coverage)
• d. harmonisation of variables, in case of
differences in definition
• e. harmonisation of classifications
• f. adjustment for measurement errors, when
corresponding variables still do not have the
same value after harmonisation for differences in
definitions
• g. imputations in the case of item nonresponse
• h. derivation of (new) variables creation of
variables out of different data sources
• i. checks for overall consistency.
• All steps are controlled by a set of integration
rules and fully automated.

60
Example Micro integration processing
• From Schulte Nordholt, Linder (2007) Statistical
Journal of the IAOS 24,163171
• Suppose that someone becomes unemployed at the
end of November and gets unemployment benefits
from the beginning of December. The jobs register
may indicate that this person has lost the job at
the end of the year, perhaps due to
administrative delay or because of payments after
job termination. The registration of benefits is
believed to be more accurate. When confronting
these facts the integrator could decide to
change the date of termination of the job to the
end of November, because it is unlikely that the
person simultaneously had a job and benefits in
December. Such decisions are made with the utmost
care. As soon as there are convincing counter
indications of other jobs register variables,
indicating that the job was still there in
December, the termination date will, in general,

61
Example Micro integration processing
• Method definition of rules for the creation of a
usable complete data set after the linkage
process.
• If these approaches are not applied, the
integrated data set can contain conflicting
information at the micro level.
• These approaches are still strictly based on
quality of data sets knowledge.
• Proposition for a possible next ESSnet on
integration study the links between imputation
and editing activities and

62
Other supporting slides
63
Macro integration coherence of estimates
• Sometimes it is useful to integrate aggregate
data, where aggregates are computed from
different sample surveys.
• For instance to include a set of tables in an
information system
• A problem is the coherence of information in
different tables.
• The adopted solution is at the estimate level
for instance, with calibration procedures (e.g.
the Virtual census in the Netherlands)

64
Project
• The objective of a project is to gather the
developments in two distinct areas
• Probabilistic expert systems these are graphical
models, characterized by the presence of an easy
updating system of the joint distribution of a
set of variables, once one of them is updated.
These models have been used for a class of
estimators that includes poststratification
estimators
• Statistical information systems SIS for the
production of statistical output (Istar) with the
objective to integrate and manage statistical
data given and validated by the Istat production
areas, in order to produce purposeful output for
the end users

65
Objectives and open problems
• Objectives
• To develop a statistical information system for
agriculture data, managing tables from FADN. FSS,
and lists used for sampling (containing census
and archive data)
• To manage coherence bewteen different tables
• To update information on data from the most
recent survey and to visualize what changes
happen to the other tables
• To allow simulations (for policy making)
• Problems
• Use of graphical models for complex survey data
• To link the selection of tables to the updating
algorithm
• To update more than one table at the same time

66
Some practical aspects for integration Software
• There exist different software tools for record
• Relais http//www.istat.it/strumenti/metodi/softw
are/analisi_dati/relais/
• R package for statistical matching
• http//cran.r-project.org/index.html
• Look for Statmatch
• Probabilistic expert systems Hugin (it does not
work with complex survey data)

67
Bibliography
• Batini C, Scannapieco M (2006) Data Quality,
Springer Verlag, Heidelberg.
• Scanu M (2003) Metodi statistici per il record
linkage, collana Metodi e Norme n. 16, Istat.
• DOrazio M., Di Zio M., Scanu M. (2006)
Statistical matching theory and practice, J.
Wiley Sons, Chichester.
• Ballin M., De Francisci S., Scanu M., Tininini
L., Vicard P. (2009) Integrated statistical
systems an approach to preserve coherence
between a set of surveys based on the use of
probabilistic expert systems, NTTS 2009,
Bruxelles.

68
Is this conditional independence?
69
And this?
70
Statistical methods of integration
• Sometimes a shorter track is used.
• Note! The automatic methods correspond to
specific data generating model

71
Statistical methods of integration
72
Statistical methods of integration
• The last approach is very appealing
• Estimate a data generating model from the two
data samples at hand
• Use this estimate for the estimation of aggregate
data (e.g. contingency tables on non jointly
observed variables)
• If necessary, develop a complete data set by
simulation from the estimated model the
integrated data generating mechanism is the
nearest to the data generating model, according
to the optimality properties of the model
estimator
• Attention! Issue 1 includes hypothesis that
cannot be tested on the available data (this is
true for record linkage and, more dramatically,
for statistical matching)