Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications. - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications.

Description:

Data integration: an overview on statistical methodologies and applications. Mauro Scanu Istat Central Unit on User Needs, Integration and Territorial Statistics – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 73
Provided by: ist129
Learn more at: http://stat.gov.pl
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data%20integration:%20an%20overview%20on%20statistical%20methodologies%20and%20applications.


1
Data integration an overview on statistical
methodologies and applications.
  • Mauro Scanu
  • Istat
  • Central Unit on User Needs, Integration and
    Territorial Statistics
  • scanu_at_istat.it

2
Summary
  • In what sense methods for integration are
    statistical?
  • Record linkage definition, examples, methods,
    objectives and open problems
  • Statistical matching definition, examples,
    methods, objectives and open problems
  • Micro integration processing definition,
    examples, methods, objectives and open problems
  • Other statistical integration methods?

3
Methods for integration 1
  • Generally speaking, integration of two data sets
    is understood as a single unit integration the
    objective is the detection of those records in
    the different data sets that belong to the same
    statistical unit. This action allows the
    reconstruction of a unique record of data that
    contains all the unit information collected in
    the different data sources on that unit.
  • On the contrary lets distinguish two different
    objectives - micro and macro
  • Micro the objective is the development of a
    complete data set
  • Macro the objective is the development of an
    aggregate (for example, a contingency table)

4
Methods for integration 2
  • Further, the methods of integration can be split
    in automatic and statistical methods
  • The automatic methods take into account a priori
    rules for the linkage of the data records
  • The statistical methods include a formal
    estimation or test procedure that should be
    applied on the available data this estimation or
    test procedure
  • can be chosen according to optimality criteria,
  • and are associated with an estimate error.
  • This talk restricts the attention on the (micro
    and macro) statistical methods of integration

5
Statistical methods
  • Classical inference
  • There exists a data generating model
  • 2) The observed sample is an image of the data
    generating model
  • 3) We estimate the model from the observed sample

6
Statistical methods of integration
  • If a method of integration is used, it is
    necessary to include an intermediate phase.
  • The final data set is a blurred image of the data
    generating model

7
Statistical methods of integration
  • Statistical methods for integration can be
    organized according to the available input

Input Output Metodo
Two data sets that observe (partially) overlapping groups of units Micro Record linkage
Two independent samples Macro/micro Statistical matching
Sets of estimates from different surveys, that are not coherent Macro Calibration methods Graphical methods
8
Record linkage
  • Input two data sets on overlapping sets of
    units.
  • Problem lack of a unique and correct record
    identifier
  • Alternative sets of variables that (jointly) are
    able to identify units
  • Attention variables can have problems!
  • Objective the largest number of correct links,
    the lowest number of wrong links

9
Book of life
  • Dunn (1946) describes record linkage in this
    way
  • each person in the world creates a book of life.
    The book starts with the birth and ends with the
    death. Its pages are made up of all the principal
    events of life. Record linkage is the name given
    to the process of assembling the pages of this
    book into one volume. The person retains the same
    identity throughout the book. Except for
    advancing age, he is the same person
  • Dunn (1946) "Record Linkage". American Journal
    of Public Health 36 (12) 14121416.

10
When there is the lack of a unique identifier
  • If a record identifier is missing or cannot be
    used, it is necessary to use the common variables
    in the two files.
  • The problem is that these variables can be
    unstable
  • Time changes (age, address, educational level)
  • Errors in data entry and coding
  • Correct answers but different codification (e.g.
    address)
  • Missing items

11
Main motivations for record linkage
  • According to Fellegi (1997), the development of
    tools for integration is due to the intersection
    of these facts
  • occasion construction of big data bases
  • tool computer
  • need new informative needs
  • Fellegi (1997) Record Linkage and Public
    Policy A Dynamic Evolution. In Alvey, Jamerson
    (eds) Record Linkage Techniques, Proceedings of
    an international workshop and exposition,
    Arlington (USA) 20-21 March 1997.

12
Why record linkage? Some examples
  • To have joint information on two or more
    variables observed in distinct data sources
  • To enumerate a population
  • To substitute (parts of) surveys with archives
  • To create a list of a population
  • Other official statistics objectives (imputation
    and editing / to enhance micro data quality to
    study the risk of identification of the released
    micro data)

13
Example 1 analysis of mortality
  • Problem to analyze jointly the risk factors
    with the event death.
  • The risk factors are observed on ad hoc surveys
    (e.g. those on nutrition habits, work conditions,
    etc.)
  • The event death (after some months the survey
    is conducted) can be taken from administrative
    archives
  • These two sources (survey on the risk factors and
    death archive) should be fused so that each
    unit observed in the risk factor survey can be
    associated with a new dichotomous variable (equal
    to 1 if the person is dead and zero otherwise).

14
Example 2 to enumerate a population
  • Problem what is the number of residents in
    Italy?
  • Often the number of residents is found in two
    steps, by means of a procedure known as
    capture-recapture. This method is usually
    applied to determine the size of animal
    populations.
  • Population census
  • Post enumeration survey (some months after the
    census) to evaluate Census quality and give an
    accurate estimate of the population size
  • USA - in 1990 Post Enumeration Survey, in 2000
    Accuracy and Coverage Evaluation
  • Italy - in 2001 Indagine di Copertura del
    Censimento

15
Example 2 to enumerate a population
  • The result of the comparison between Census and
    post enumeration survey is a 2?2 table

16
Example 2 - to enumerate a population
  • For short, for any distinct unit it is necessary
    to understand if it was observed
  • 1) both in the census and in the PES
  • 2) only in the census
  • 3) only in the PES
  • These three values allow to estimate (with an
    appropriate model) the fourth value.

17
Example 3 surveys and archives
  • Problem is it possible to use jointly
    administrative archives and sample surveys?
  • At the micro level this means to modify the
    questionnaire of a survey dropping those
    questions that are already available on some
    administrative archives (reduction of the
    response burden)
  • E.g., for enterprises
  • Social security archives, chambers of commerce,

18
Example 4 Creation of a list
  • Problem what is the set of the active
    enterprises in Italy?
  • In Istat, ASIA (Archivio Statistico delle Imprese
    Attive) is the most important example of a
    creation of a list of units (the active
    enterprises in a time instant) fusing different
    archives.
  • It is necessary to pay attention to
  • Enterprises which are present in more than one
    archives (deduplication)
  • Non active enterprises
  • New born enterprises
  • transformations (that can lead to a new
    enterprise or to a continuation of the previous
    one)

19
Example 5 Imputation and editing
  • Problem to enhance microdata quality
  • Micro Integration in the Netherlands (virtual
    census, social statistical data base)
  • It will be seen later, when dealing with micro
    integration processing

20
Example 6 - Privacy
  • Problem does it exist a measure of the degree
    of identification of the released microdata?
  • In order to evaluate if a method for the
    protection of data disclosure is good, it is
    possible to compare two datasets (the true and
    the protected ones) and detect how many modified
    records are easily linked to the true ones.

21
Record linkage steps
The record linkage techniques are a
multidisciplinary set of methods and practices
  • DECISION MODEL CHOICE
  • Fellegi Sunter
  • exact
  • Knowledge based
  • Mixed
  • SEARCH SPACE REDUCTION
  • Sorted Neighbourhood Method
  • Blocking
  • Hierarchical Grouping

......
RECORD LINKAGE
......
......
  • PRE-PROCESSING
  • Conversion of upper/lower cases
  • Replacement of null strings
  • Standardization
  • Parsing
  • COMPARISON FUNCTION CHOICE
  • Edit distance
  • Smith-Waterman
  • Q-grams
  • Jaro string comparator
  • Soundex code
  • TF-IDF

Tiziana Tuoto, FCSM 2007, Arlington, November 6
2007
22
Example (Fortini, 2008)
  • Census is sometimes associated with a post
    enumeration surveys, in order to detect the
    actual census coverage.
  • To this purpose, a capture-recapture approach
    is generally considered.
  • It is necessary to find out how many individuals
    have been observed
  • in both the census and the PES
  • Only in the census
  • Only in the PES
  • These figures allow to estimate how many
    individuals have NOT been observed in both the
    census and the PES
  • In ESSnet Statistical Methodology Project on
    Integration of Survey and Administrative Data
    Report of WP2. Recommendations on the use of
    methodologies for the integration of surveys and
    administrative data, 2008

23
Record linkage workflow for Census - PES
Step 1
Step 2
Step 3.a
Step 3.b
Step 4.b
Step 4.a
Step 5
24
Problem Lack of identifiers
  • Difference between step 1 and step 2 is that
  • Step 1 identifies all those households that
    coincide for all these variables
  • Name, surname and date of birth of the household
    head
  • Address
  • Number of male and female components
  • Step 2 uses the same keys, but admits the
    possibility of differences of the variable states
    for modifications of errors

25
Probabilistic record linkage
  • For every pairs of records from the two data
    sets, it is necessary to estimate
  • The probability that the differences between what
    observed on the two records is due to chance,
    because the two records belong to the same unit
  • The probability that the two records belong to
    different units
  • These probabilities are compared this comparison
    is the basis for the decision whether a pair of
    records is a match or not
  • Estimate of this probability is the statistical
    step in the probabilistic record linkage method

26
Statistical step
  • Data set A with na units.
  • Data set B with nb units.
  • K key variables (they jointly make an identifier)

27
Statistical procedure
  • The key variables of the two records in a pair
    (a,b) is compared
  • yabf(xAa,xBb)
  • The function f(.) should register how much the
    key variables observed in the two units are
    different.
  • For instance, y can be a vector with k
    components, composed of 0s (inequalities) or 1s
    (equalities)
  • The final result is a data set of na x nb
    comparisons

28
Statistical procedure
  • The na x nb pairs are split in two sets
  • M the pairs that are a match
  • U the unmatched pairs
  • Likely, the comparisons y will follow this
    situation
  • Low levels of diversity for the pairs that are
    match, (a,b)?M
  • High levels of diversity for the pairs that are
    non-match, (a,b)?U
  • For instance if y(sum of the equalities for the
    k key variables), y tends to assume large values
    for the pairs in M with respect to those in U

29
Statistical procedure
If y(sum of the equalities), the distribution of
y is a mixture of the distribution of y in M
(right) and that in u (left)
30
Statistical procedure
Inclusion of a pair (a,b) in M or U is a missing
value (latent variable). Let C denote the status
of a pair (C1 if (a,b) in M C0 if (a,b) in
U) Likelihood is the product on the na x nb pairs
of P(Yy, Cc) p m(y)c (1-p)
u(y)(1-c) Estimation method maximum
likelihood on a partially observed data set (EM
algorithm Expectation Maximization)
Parameters data
p fraction of matches among the na x nb pairs Y observed
m(y) distribution of y in M C missing (latent)
u(y) distribution of y in U
31
Statistical procedure
A pair is assigned to M or U in the following
way 1) For every comparison y assign a
weight t(y)m(y)/u(y) where m and u are
estimated 2) Assign the pairs with a large
weight to M and the pairs with a small weight to
U. 3) There can be a class of weights t where it
is better to avoid definitive decisions (m and u
are similar)
32
Statistical procedure
The procedure is the following. Note that,
generally, probabilities of mismatching are still
not considered
33
Open problems
  • Different probabilistic record linkage aspects
    should still be better investigated. Two of them
    are related to record linkage quality
  • What model should be considered
  • a1) on the pairs relationship (Copas and Hilton,
    1990)
  • a2) on the key variables relationship
    (Thibaudeau, 1993)
  • b) How probabilities of mismatching can be used
    for a statistical analysis of a linked data file?
    (Scheuren and Winkler, 1993, 1997)
  • Copas J.R., Hilton F.J. (1990). Record linkage
    statistical models for matching computer
    records. Journal of the Royal Statistical
    Society, Series A, 153, 287-320.
  • Thibaudeau Y. (1993). The discrimination power
    of dependency structures in record linkage.
    Survey Methodology, 19, 31-38.
  • Scheuren F., Winkler W.E. (1993). Regression
    analysis of data files that are computer
    matched. Survey Methodology, 19, 39-58
  • Scheuren F., Winkler W.E. (1997). Regression
    analysis of data files that are computer matched
    - part II. Survey Methodology, 23, 157-165.

34
Statistical matching
  • What kind of integration should be considered if
    the analysis involves two variables observed in
    two independent sample surveys?
  • Let A and B be two samples of size nA and nB
    respectively, drawn from the same population.
  • Some variables X are observed in both samples
  • Variables Y are observed only in A
  • Variables Z are observed only in B.
  • Statistical matching aims at determining
    information on (XYZ), or at least on the pairs
    of variables which are not observed jointly (YZ)

35
Statistical matching
  • It is very improbable that the two samples
    observe the same units, hence record linkage is
    useless.

36
Some statistical matching applications 1
  • The objective of the integration of the Time Use
    Survey (TUS) and of the
  • Labour Force Survey (LFS) is to create at a micro
    level, a synthetic file of
  • both surveys that allows the study of the
    relationships between variables
  • measured in each specific survey.
  • By using together the data relative to the
    specific variables of both surveys,
  • one would be able to analyse the characteristics
    of employment and the
  • time balances at the same time.
  • Information on labour force units and the
    organisation of her/his life
  • times will help enhance the analyses of the
    labour market
  • The analyses of the working condition
    characteristics that result from
  • the labour force survey will integrate the TUS
    more general analysis of
  • the quality of life

37
Some statistical matching applications 1
  • The possibilities for a reciprocal enrichment
    have been largely recognised
  • (see the 17th International Conference of Labour
    Statistics in 2003 and the
  • 2003 and 2004 works of the Paris group). The
    emphasis was indeed put on
  • how the integration of the two surveys could
    contribute to analysing the
  • different participation modalities in the labour
    market determined by hour
  • and contract flexibility.
  • Among the issues raised by researchers on time
    use, we list the following
  • two
  • the usefulness and limitations involved in using
    and combining various
  • sources, such as labour force and time-use
    surveys, for improving data
  • quality
  • Time-use surveys are useful, especially for
    measuring hours worked of
  • workers in the informal economy, in home-based
    work, and by the
  • hidden or undeclared workforce, as well as to
    measure absence from
  • work

38
Some statistical matching applications 1
  • Specific variables in the TUS (Y ) it enables to
    estimate the time
  • dedicated to daily work and to study its level of
    "fragmentation" (number of intervals/interruptions
    ), flexibility (exact start and end of working
    hours) and intra-relations with the other life
    times
  • Specific variables in the LFS (Z) The vastness
    of the information gathered allow us to examine
    the peculiar aspects of the Italian participation
    in the labour market professional condition,
    economic activity sector, type of working hours,
    job duration, profession carried out, etc.
    Moreover, it is also possible to investigate
    dimensions relative to the quality of the job

39
Some statistical matching applications 2
  • The Social Policy Simulation Database and Model
    (SPSD/M) is a micro computer-based product
    designed to assist those interested in analyzing
    the financial interactions of governments and
    individuals in Canada (see http//www.statcan.ca/e
    nglish/spsd/spsdm.htm).
  • It can help one to assess the cost implications
    or income redistributive
  • effects of changes in the personal taxation and
    cash transfer system.
  • The SPSD is a non-confidential, statistically
    representative database of individuals in their
    family context, with enough information on each
  • individual to compute taxes paid to and cash
    transfers received from
  • government.

40
Some statistical matching applications 2
  • The SPSM is a static accounting model which
    processes each individual and family on the SPSD,
    calculates taxes and transfers using legislated
    or proposed programs and algorithms, and reports
    on the results.
  • It gives the user a high degree of control over
    the inputs and outputs to the model and can allow
    the user to modify existing tax/transfer programs
    or test proposals for entirely new programs. The
    model can be run using a visual interface and it
    comes with full documentation.

41
Some statistical matching applications 2
  • In order to apply the algorithms for
    microsimulation of taxtransfer benefits
  • policies, it is necessary to have a data set
    representative of the Canadian
  • population. This data set should contain
    information on structural (age,
  • sex,...), economic (income, house ownership, car
    ownership, ...), healthrelated (permanent
    illnesses, child care,...) social (elder
    assistance,
  • culturaleducational benefits,...) variables
    (among the others).
  • It does not exist a unique data set that
    contains all the variables that can influence the
    fiscal policy of a state
  • In Canada 4 samples are integrated (Survey of
    consumers finances, Tax return data, Unemployment
    insurance claim histories, Family expenditure
    survey)
  • Common variables some socio-demographic
    variables
  • Interest is on the relation between the distinct
    variables in the different
  • samples

42
Example (Coli et al, 2006)
  • The new European System of the Accounts (ESA95)
    is a detailed source of information on all the
    economic agents, as households and enterprises.
    The social accounting matrix (SAM) has a relevant
    role.
  • Module on households it includes the amount of
    expenditures and income, per typology of
    household
  • Coli A., Tartamella F., Sacco G., Faiella I.,
    DOrazio M., Di Zio M., Scanu M., Siciliani I.,
    Colombini S., Masi A. (2006). La costruzione di
    un Archivio di microdati sulle famiglie italiane
    ottenuto integrando lindagine ISTAT sui consumi
    delle famiglie italiane e lIndagine Banca
    dItalia sui bilanci delle famiglie italiane,
    Documenti ISTAT, n.12/2006.

43
Example
  • Problem
  • Income are observed on a Bank of Italy survey
  • Expenditures are observed on an Istat survey
  • The two samples are composed of different
    households, hence record linkage is useless

44
Adopted solutions 1
  • The first statistical matching solution was
    imputation of missing data. Usually, distance
    hot deck was used.
  • In pratice, this method mimics record linkage
    instead of matching records of the same unit,
    this approach matches records of similar units,
    where similarity is in terms of the common
    variables in the two files.
  • The procedure is
  • 1) Compute the distances between the matching
    variables for every pair of records
  • 2) Every record in A is associated to that record
    in B with minimum distance

45
Adopted solutions 1
  • The inferential path is the following

46
Adopted solutions 2
  • It is applied an estimate procedure under
    specific models that considers the presence of
    missing items. The easiest model is conditional
    independence of the never jointly observed
    variables (e.g., income and expenditures) given
    the matching variables.
  • Example
  • Y income, Z expenditures, X house surface
  • (X,Y,Z) is distributed as a multivariate normal
    with parameters
  • Mean vector ?
  • Variance matrix ?

47
Adopted solutions 2
  • Estimate the regression equation on A Y??X
  • Impute Y in B Yb??Xb , b1,,nB
  • Estimate the regression equation in B Z??X
  • Impute Z in A Za ??Xa , a1,,nA

48
Adopted solutions 2
  • The inferential mechanism assumes that
  • Y and Z are independent given X
  • (there is not the regression coefficient of Z on
    Y
  • given X)

49
Adopted solutions 2
  • This method can be applied also with this
    inferential scheme the problem is what
    hypotheses are before the analysis phase

50
Adopted solutions 3
  • We do not hypothesize any model. It is estimated
    a set of values, one for every plausible model
    given the observed data
  • Example
  • When matching two sample surveys on farms
    (Rica-Rea - FADN and SPA - FSS), it was asked the
    following contingency table for farms
  • Y presence of cattle (FSS)
  • Z class of intermediate consumption (from FADN)
  • Using the common variables
  • X1 Utilized Agricultural Area (UAA) ,
  • X2 Livestock Size Unit (LSU)
  • X3 geographical characteristics

51
Example
  • We consider all the models that we cam estimate
    from the observed data in the two surveys
  • In practice, the available data allow to say that
    the estimate of the number of farms with at least
    one cow (Y1) in the lowest class of intermediate
    consumption (Z1) is between 2,9 and 4,9

52
Inferential machine
  • The inferential machine does not use any specific
    model

It is possible to simulate data including
uncertainty on the data generation model (e.g. by
multiple imputation)
53
Quotation (Manski, 1995)
  • The pressure to produce answers, without
    qualifications, seems particularly intense in the
    environs of Washington, D.C. A perhaps
    apocryphal, but quite believable, story
    circulates about an economists attempt to
    describe his uncertainty about a forecast to
    President Lyndon Johnson. The economist presented
    his forecast as a likely range of values for the
    quantity under discussion. Johnson is said to
    have replied, Ranges are for cattle. Give me a
    number
  • Manski, C. F. (1995) Identification problems in
    the Social Sciences, Harvard University Press.
  • Manski and other authors show that in a wide
    range of applied areas (econometrics, sociology,
    psychometrics) there is a problem of
    identifiability of the models of interest,
    usually caused by the presence of missing data.
    The statistical matching problem is an example of
    this.

54
Why statistical matching?
  • Applications in Istat
  • SAM
  • Joint analysis FADN / FSS
  • Joint use of Time Use / Labour force
  • Objectives
  • Estimates of parameters of not jointly observed
    parameters
  • Creation of synthetic data (e.g. data set for
    microsimulation)

55
Open problems
  • Uncertainty estimate (DOrazio et al, 2006)
  • Variability of uncertainty (Imbens e Manski,
    2004)
  • Use of sample drawn according to complex survey
    designs (Rubin, 1986 Renssen, 1998)
  • Use of nonparametric methods (Marella et al,
    2008 Conti et al 2008)
  • Conti P.L., Marella D., Scanu M. (2008).
    Evaluation of matching noise for imputation
    techniques based on the local linear regression
    estimator. Computational Statistics and Data
    Analysis, 53, 354-365.
  • DOrazio M., Di Zio M., Scanu M. (2006).
    Statistical Matching for Categorical Data
    Displaying Uncertainty and Using Logical
    Constraints, Journal of Official Statistics, 22,
    137-157.
  • Imbens, G.W, Manski, C. F. (2004). "Confidence
    intervals for partially identified parameters".
    Econometrica, Vol. 72, No. 6 (November, 2004),
    18451857 
  • Marella D., Scanu M., Conti P.L. (2008). On the
    matching noise of some nonparametric imputation
    procedures, Statistics and Probability Letters,
    78, 1593-1600.
  • Renssen, R.H. (1998) Use of statistical matching
    techniques in calibration estimation. Survey
    Methodology 24, 171183.
  • Rubin, D.B. (1986) Statistical matching using
    file concatenation with adjusted weights and
    multiple imputations. Journal of Business and
    Economic Statistics 4, 8794.

56
Micro integration processing
  • It can be applied every time it is produced a
    complete data set (micro level) by any kind of
    method. Up to now, applied after exact record
    linkage
  • Micro integration processing consists of putting
    in place all the necessary actions aimed to
    ensure better quality of the matched results as
    quality and timeliness of the matched files. It
    includes
  • defining checks,
  • editing procedures to get better estimates,
  • imputation procedures to get better estimates.

57
Micro integration processing
  • It should be kept in mind that some sources are
    more reliable than others.
  • Some sources have a better coverage than others,
    and there may even be conflicting information
    between sources.
  • So, it is important to recognize the strong and
    weak points of all the data sources used.

58
Micro integration processing
  • Since there are differences between sources, a
    micro integration process is needed to check data
    and adjust incorrect data. It is believed that
    integrated data will provide far more reliable
    results, because they are based on an optimal
    amount of information. Also the coverage of (sub)
    populations will be better, because when data are
    missing in one source, another source can be
    used. Another advantage of integration is that
    users of statistical information will get one
    figure on each social phenomenon, instead of a
    confusing number of different figures depending
    on which source has been used.

59
Micro integration processing
  • During the micro integration of the data sources
    the following steps have to be taken (Van der
    Laan, 2000)
  • a. harmonisation of statistical units
  • b. harmonisation of reference periods
  • c. completion of populations (coverage)
  • d. harmonisation of variables, in case of
    differences in definition
  • e. harmonisation of classifications
  • f. adjustment for measurement errors, when
    corresponding variables still do not have the
    same value after harmonisation for differences in
    definitions
  • g. imputations in the case of item nonresponse
  • h. derivation of (new) variables creation of
    variables out of different data sources
  • i. checks for overall consistency.
  • All steps are controlled by a set of integration
    rules and fully automated.

60
Example Micro integration processing
  • From Schulte Nordholt, Linder (2007) Statistical
    Journal of the IAOS 24,163171
  • Suppose that someone becomes unemployed at the
    end of November and gets unemployment benefits
    from the beginning of December. The jobs register
    may indicate that this person has lost the job at
    the end of the year, perhaps due to
    administrative delay or because of payments after
    job termination. The registration of benefits is
    believed to be more accurate. When confronting
    these facts the integrator could decide to
    change the date of termination of the job to the
    end of November, because it is unlikely that the
    person simultaneously had a job and benefits in
    December. Such decisions are made with the utmost
    care. As soon as there are convincing counter
    indications of other jobs register variables,
    indicating that the job was still there in
    December, the termination date will, in general,
    not be adjusted.

61
Example Micro integration processing
  • Method definition of rules for the creation of a
    usable complete data set after the linkage
    process.
  • If these approaches are not applied, the
    integrated data set can contain conflicting
    information at the micro level.
  • These approaches are still strictly based on
    quality of data sets knowledge.
  • Proposition for a possible next ESSnet on
    integration study the links between imputation
    and editing activities and

62
Other supporting slides
63
Macro integration coherence of estimates
  • Sometimes it is useful to integrate aggregate
    data, where aggregates are computed from
    different sample surveys.
  • For instance to include a set of tables in an
    information system
  • A problem is the coherence of information in
    different tables.
  • The adopted solution is at the estimate level
    for instance, with calibration procedures (e.g.
    the Virtual census in the Netherlands)

64
Project
  • The objective of a project is to gather the
    developments in two distinct areas
  • Probabilistic expert systems these are graphical
    models, characterized by the presence of an easy
    updating system of the joint distribution of a
    set of variables, once one of them is updated.
    These models have been used for a class of
    estimators that includes poststratification
    estimators
  • Statistical information systems SIS for the
    production of statistical output (Istar) with the
    objective to integrate and manage statistical
    data given and validated by the Istat production
    areas, in order to produce purposeful output for
    the end users

65
Objectives and open problems
  • Objectives
  • To develop a statistical information system for
    agriculture data, managing tables from FADN. FSS,
    and lists used for sampling (containing census
    and archive data)
  • To manage coherence bewteen different tables
  • To update information on data from the most
    recent survey and to visualize what changes
    happen to the other tables
  • To allow simulations (for policy making)
  • Problems
  • Use of graphical models for complex survey data
  • To link the selection of tables to the updating
    algorithm
  • To update more than one table at the same time

66
Some practical aspects for integration Software
  • There exist different software tools for record
    linkage record linkage and statistical matching
  • Relais http//www.istat.it/strumenti/metodi/softw
    are/analisi_dati/relais/
  • R package for statistical matching
  • http//cran.r-project.org/index.html
  • Look for Statmatch
  • Probabilistic expert systems Hugin (it does not
    work with complex survey data)

67
Bibliography
  • Batini C, Scannapieco M (2006) Data Quality,
    Springer Verlag, Heidelberg.
  • Scanu M (2003) Metodi statistici per il record
    linkage, collana Metodi e Norme n. 16, Istat.
  • DOrazio M., Di Zio M., Scanu M. (2006)
    Statistical matching theory and practice, J.
    Wiley Sons, Chichester.
  • Ballin M., De Francisci S., Scanu M., Tininini
    L., Vicard P. (2009) Integrated statistical
    systems an approach to preserve coherence
    between a set of surveys based on the use of
    probabilistic expert systems, NTTS 2009,
    Bruxelles.

68
Is this conditional independence?
69
And this?
70
Statistical methods of integration
  • Sometimes a shorter track is used.
  • Note! The automatic methods correspond to
    specific data generating model

71
Statistical methods of integration
72
Statistical methods of integration
  • The last approach is very appealing
  • Estimate a data generating model from the two
    data samples at hand
  • Use this estimate for the estimation of aggregate
    data (e.g. contingency tables on non jointly
    observed variables)
  • If necessary, develop a complete data set by
    simulation from the estimated model the
    integrated data generating mechanism is the
    nearest to the data generating model, according
    to the optimality properties of the model
    estimator
  • Attention! Issue 1 includes hypothesis that
    cannot be tested on the available data (this is
    true for record linkage and, more dramatically,
    for statistical matching)
About PowerShow.com