Title: Sampling strategies for the estimation of long form census variables
1Sampling strategies for the estimation of long
form census variables
- Francesco Borrelli Istat
- Giancarlo Carbonetti Istat
- Luana De Felici Istat
- Claudia De Vitiis Istat
- Francesca Inglese Istat
- Fabrizio Solari Istat
- SCORUS Conference Darmstadt, October 17-19 2007
2Introduction /1
- The Italian Institute of Statistics (ISTAT) has
recently started a project study to evaluate
alternative census methods in order to improve
the efficiency of the survey operations and
reduce the statistical burden. - This work is only related to the achievement of
the 2011 Population Census in Italy.
3Introduction /2
- One of the innovation is referred to the
possibility to divide the overall set of census
variables into two subsets the first containing
the demographic census variables and the latter
the remaining variables (educational level,
employment status, commuting). - Then, only for the first set of variables,
traditional census methods will be used, while
Istat will carry out a sampling survey for the
second set of variables in the municipalities
with population size greater than 10,000
inhabitants. - In the smallest municipalities census forms will
be submitted for all the variables (long form) in
a traditional way. - In the other municipalities reduced census forms
(short form) will be submitted to all the
households and only to a sample of them the long
form will be given.
4Aim of the study
- The study takes into account several sampling
strategies for the submission of the long form. - To evaluate the efficiency of the sampling
strategies a simulation study has been carried
out. - The proposed sampling designs have to be coherent
with the census framework and have to be able to
produce statistical information for small
sub-regions with a good level of accuracy.
5Alternative strategies
- The sampling strategies under analysis are
- a) Sampling design of households
- the sample is drawn from Administrative
Registers. - b) Area frame sampling
- the sample is given by the enumeration areas
(clusters of households) selected from Digital
Georeferenced Database.
6Some aspects of the sampling design for the long
form
Domains sub-municipality areas
(sub-areas) Target Variables cross-classificatio
n of educational level, employment status and
commuting, with demographic variables Sampling
Units households or enumeration area, depending
on the strategy Sampling rate the long form is
submitted to about 1/3 of the households in each
municipalities Estimator final weights are
computed from the sampling weights by means of a
calibration process so that the sample is more
representative
7Sampling design of households
- Simple Design, without replacement and with equal
probability of selection for each household in
the register (CCSFAM). - Stratified design, without replacement and with
equal probability of selection for each household
in each stratum - stratification by household size (STRNCOMP)
- stratification by age of head of household
(STRETACAP).
8Area frame sampling design
- Simple Design, without replacement and with
equal probability of selection for each
enumeration area (CCSSEZ). - Stratified design, without replacement and with
equal probability of selection for each
enumeration area in each stratum from the area
frame sorted by population size - stratification into three strata having
approximately the same population size (but
unequal number of enumeration areas) (STRSPOP) - stratification into three strata having
approximately equal number of enumeration areas
(but substantially unequal population size)
(STRSSEZ).
9Main aspects of the simulation study
- Source of data 2001 population census data
- Variables under study cross-classification of
educational level, employment status and
commuting, with sex - (90 dichotomous variables totally)
- Municipalities Milano, Perugia and Aosta chosen
to represent large, medium and small
population size municipalities - Calibration constraints cross-classification of
sex and age, - cross-classification of sex and civil status
(40 variables totally) - Software Genesees v3.0 developed in Istat has
been used in order to compute final weights in
the calibration process.
10Simulation algorithm
- The computational algorithm, implemented in SAS
code, consists of the following steps (for each
municipality and for each alternative sampling
design) - 1) selection of a sample (of households or
enumeration areas) - 2) computation of final weights
- 3) calculation of the estimates of the relative
frequencies p for each dichotomous target
variable - 4) iteration of steps 1), 2) and 3) for a fixed
number of replications (1.000 sampling
replications) - 5) computation for each dichotomous target
variable, of the mean and standard error of the
estimates from the simulated sampling
distribution.
11Evaluation criterion the coefficient of variation
- In order to compare the sampling strategies we
have considered as evaluation criterion the
coefficient of variation - which represents an accuracy measurement of the
sampling estimates. - The distribution given by the empirical cvs for
all the 90 dichotomous target variables has been
determined. After dividing the dichotomous
variables into classes depending on the value of
p, we have studied the distribution of the cvs
related to the variable in the same group.
12Main results /1
- ? The median cv decreases for increasing values
of p
13Scatter plot of cv and p for each sub-area.
CCSSEZ design. City of Perugia.
3
2
1
14Main results /1
- ? The median cv decreases for increasing values
of p - ? Due to the cluster effect, sampling designs of
households results in sampling errors smaller
than sampling designs of enumeration areas
15Distribution of median cv for classes of p for
all the alternative sampling strategies
(estimation at sub-area level). City of Milano.
16Distribution of median cv for classes of p for
all the alternative sampling strategies
(estimation at municipality level). City of
Perugia.
17Main results /1
- ? The median cv decreases for increasing values
of p - ? Due to the cluster effect, sampling designs on
households results in sampling errors smaller
than sampling designs on enumeration areas - ?The stratification of the households seems not
to produce significant reduction of sampling
errors, both for municipality and sub-
municipality level
18Distribution of cv (min, median and max) for
classes of p for all the households sampling
designs (estimation at sub-area level). City of
Milano.
19Distribution of cv (min, median and max) for
classes of p for all the households sampling
designs (estimation at municipality level). City
of Milano.
20Main results /1
- ? The median cv decreases for increasing values
of p - ? Due to the cluster effect, sampling designs on
households results in sampling errors smaller
than sampling designs on enumeration areas - ?The stratification of the households seems not
to produce significant reduction of sampling
errors, both for municipality and sub-
municipality level - ? As far as the stratification of the enumeration
areas it is concerned, there are no clear
differences among the sampling strategies
referred to sub-area level estimation. With
regard to municipality level estimation, the
stratification implies smaller values of the
cv only when pgt5
21Distribution of cv (min, median and max) for
classes of p for all the area frame sampling
(estimation at municipality level). City of Aosta.
22Main results /2
- Considering the same sampling design for all the
municipalities, we can observe that - ? the median cvs referred to the sub-area level
dont differ substantially from one
municipality to another
23Comparison of distribution of median cv for
classes of p for estimation at sub-area level
referred to the city of Milano, Perugia and
Aosta. Sampling design CCSFAM and CCSSEZ.
CCSFAM
CCSSEZ
24Main results /2
- Considering the same sampling design for all the
municipalities, we can observe that - ? the median cvs referred to the sub-area level
dont differ substantially each other - ? the median cvs referred to the municipality
level are significantly different, showing,
obviously, the smallest values in the largest
municipality
25Distribution of cv (min, median and max) for
classes of p for estimation at municipality
level. Sampling design CCSFAM. Milano, Perugia
and Aosta.
26Comparison of distribution of median cv for
classes of p for estimation at municipality level
referred to the city of Milano, Perugia and
Aosta. Sampling design CCSFAM and CCSSEZ.
CCSFAM
CCSSEZ
27Main results /2
- Considering the same sampling design for all the
municipalities, we can observe that - ? the median cvs referred to the sub-area level
dont differ substantially each other - ? the median cvs referred to the municipality
level are significantly different, showing
the smallest values in the largest municipality - ? the municipality level estimates display
smaller cvs values than the sub-area level
estimates
28Median cv for three classes of p for estimation
at sub-area level and municipality level for each
sampling design (Milano, Perugia and Aosta).
29Main results /2
- Considering the same sampling design for all the
municipalities, we can observe that - ? the median cvs referred to the sub-area level
dont differ substantially each other - ? the median cvs referred to the municipality
level are significantly different, showing
the smallest values in the largest municipality - ? the municipality level estimates display
smaller cvs values than the sub-area level
estimates - ? the larger the municipality population size the
biggest the reduction of cv values when moving
from sub-area level estimation to municipality
level estimation
30Comparison of median cv for two classes of p
between estimation at sub-area level and
municipality level for each sampling design
(Milano, Perugia and Aosta).
31Main results /2
- Considering the same sampling design for all the
municipalities, we can observe that - ? the median cvs referred to the sub-area level
dont differ substantially each other - ? the median cvs referred to the municipality
level are significantly different, showing
the smallest values in the largest municipality - ? the municipality level estimates display
smaller cvs values than the sub-area level
estimates - ? the larger the municipality population size the
biggest the reduction of cv values when moving
from sub-area level estimation to municipality
level estimation - ? with reference to sub-area level estimation, we
have small cv values for large sub-areas
(gt10.000 inhabitants) or for sub-areas with - a large number of enumeration areas (gt50).
32Distribution of median cv for four classes of p
and three classes of sub-areas (according to
population size) for all the sampling designs.
City of Milano.
33Distribution of median cv for four classes of p
and three classes of sub-areas (according to
number of enumeration areas) for all the sampling
designs. City of Milano.
34Comparison of the distributions of median cv for
three classes of sub-areas (according to
population size). CCSFAM sampling design. City of
Milano.
35Comparison of the distributions of median cv for
three classes of sub-areas (according to number
of enumeration areas). CCSFAM sampling design.
City of Milano.
36Conclusions
- The results seem to encourage the use of sampling
techniques for the adoption of the long form
strategy in the population census. - Sampling of households produces more efficient
estimates than sampling of enumeration areas.
Area frame sampling, however, represents a
practicable solution to produce estimates with
reliable quality level. - The only statistical information about households
available on administrative registers doesnt
allow efficient stratification. - For the area frame sampling, the best
stratification criterion seems to divide the
enumeration areas in to three strata with
approximately equal population size. - More accurate estimates are observed for the
largest sub-areas (in terms of population size or
number of enumeration areas).
37Final remark
The comparative analysis shows that the sampling
strategies on households and the area frame
sampling could be applied simultaneously for a
complex sampling strategy using the long form
approach in the 2011 population census in
Italy. In fact, where reliable administrative
registers are not available, then the use of area
frame sampling seems to be a good alternative
solution. For any question please contact
Giancarlo Carbonetti at the following e-mail
address carbonet_at_istat.it