Title: TwoStage Cluster Sampling from Equal Clusters:
1- Two-Stage Cluster Sampling from Equal Clusters
- Basic ideas
- A two-stage sampling is a natural extension of
one-stage cluster sampling when a hierarchical
frame is used. - Use of two-stage sampling will improve design
efficiency when the within cluster (PSU)
variance is small. No reason to observe all
elements in the cluster when the elements in the
cluster are alike -
2- In the first stage m clusters can be randomly
sampled from M clusters, and in the second stage
units can be randomly sampled from units.
-
- Simple two-stage cluster sampling serves as a
basic model for multi-stage sampling designs.
3- Number of possible samples
- The total number of possible such samples will
be -
4- For example, there are 27 possible samples in a
two stage sample of m2 and 2 from a
population of M3 and 3 -
-
5- Unbiased estimators
- Under simple two-stage sampling design
described above, unbiased estimates can be
obtained by applying the theory of simple random
sampling in two stages. -
- (mean per unit in the sample) is
unbiased estimate of (mean per
unit in the population) - is unbiased estimate of
(mean per cluster in the population - is unbiased estimate of
X (population total)
6- Sampling variance
- True sampling variance of (mean per unit)
can be obtained by , -
- where is variance among clusters (first
stage units) and is variance among
second-stage units within clusters.
7- Sampling variance for (mean per cluster)
can be obtained by multiplying to
. -
- Sampling variance for x (population total)
can be obtained by multiplying
to .
8- Estimator of sampling variance
- The unbiased estimator of can be
obtained by substituting and
with the following quantities
9Where
- Note that cannot be directly estimated
from - alone, whereas can be
directly from
10- Then we get the estimator of sampling variance
- Variance estimators for and x can be
obtained by multiplying and
respectively to the above estimator. - Note that the contribution from the second
term will be very small as
long as is a small fraction.
11- The above variance formula for two-stage design
can easily be extended to a simple three-stage
design by adding the third term. But
contribution from the third term will be even
smaller than that of the second term. - The formula given in Box 10.2 (page 281) is
ultimate cluster approximation, which is based
on the first term of the above estimator, with
the substitution of
12- Example to verify the above formulas
- Consider a two-stage sample of m2 and 2
taken from a population M3 and 3.
Suppose the population consists of 3 clusters
with following values -
- Cluster 1 1 6 7 14 4.67
- Cluster 2 2 5 8 15 5.00
- Cluster 3 3 4 9 16 5.33
13- There are 27 possible samples
14- Sample estimates for these possible samples
are shown in the attachment along with various
expected values calculated from the possible
samples. -
- This example demonstrates how the above
formulas work and how the ultimate cluster
approximation given in the text (Box 10.2)
compares with the exact formula. -
15- Example for applying the estimators
- A set of 20,000 medical records is stored in
400 file drawers, each containing 50 records.
In drawing a two-stage sample, 5 records are
randomly selected from each of 80 randomly
selected drawers (M400, m80, 50, and
5). For one variable X, we obtained 15.2,
9050 and 805. -
- Using the two-term estimator,
16- Using the ultimate cluster approximation in
the text, - Using the first term only,
- Using the first term, ignoring fpc,
17- Sampling variance can be estimated quite
satisfactorily, ignoring the second stage units. - The ultimate cluster approximation appears to
be a good compromise.
18- The case of proportion in two-stage sampling
- The unbiased estimate of population proportion
is - where
- Estimator of sampling variance
19 Where and
20Example for the case of proportion A large store
handles about 20,000 accounts receivable
per month. A 2 sample ( 400) was verified
every other month for the last 4 years (M48, and
m24). The number of accounts found to be in
error per month was as follows 0, 0,
1, 1, 2, 4, 4, 5, 5, 5, 5, 6, 6, 6,
7, 7, 8, 9, 9, 10, 10, 13, 14, 17 (sum154)
or 1.6 applying the above formula,
2195 confidence interval(1.21, 2.00)
95 confidence
interval (1.14, 2.06), using the textbook
formula (See STATA output)
22- Optimal subsample size
- One of the key questions in designing a
two-stage cluster sample is how to determine the
sample size in the second stage. -
- Optimal choice of depends on within
cluster variance and the relative costs of
survey at both stages.
23- Considering a cost function of
and sampling variance of
variance would be minimized for fixed
C (C would be minimized for fixed V) when the
following condition is satisfied (by using the
Cauchy-Schwarz inequality) -
24- This can be calculated from sample data by
substituting with
calculated from sample data as shown below. -
- This suggests us to choose a larger when
the interclass correlation is small (large
within cluster variability) and when the unit
cost for cluster is large relative the unit cost
for element. -
25- The intraclass correlation can be calculated
by the formula (10.10) on page 298. The same
formula can be used for sample data by
substituting - with
. -
26Example for optimal subsample size
consideration Let us calculate the optimal
for the above example of store accounts, assuming
C1100 and C210
27This suggests that 0.225 (45) sample will be
sufficient, instead of 2 sample (200).
28- Notes on definition of between-cluster variation
- The following notes will help you relate the
textbook definition to convention you are
familiar with and understand different
definitions used in other books -
- In analysis of variance, the mean square
between groups is defined as
(for equal size group or cluster)
29- The definition of between cluster variation
used in the text is
- Between group sum of squares can be obtained by
or
30- Cochran defines between cluster variation as
- If we use the definition used in the text the
formula on the first page (last column) of
handout should be changed to