Variance Estimation in the Presence of Nearest Neighbor Imputed Data - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Variance Estimation in the Presence of Nearest Neighbor Imputed Data

Description:

1. Variance Estimation in the Presence of Nearest Neighbor Imputed Data ... Presented at Nearest Neighbors Workshop, Minneapolis, MN. Aug. 28, 2006. 2. Overview ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 29

Provided by: lem77

Category:

more less

Transcript and Presenter's Notes

Title: Variance Estimation in the Presence of Nearest Neighbor Imputed Data

1
Variance Estimation in the Presence of Nearest
Neighbor Imputed Data

Temesgen H., B.N.I. Eskelson, and T.M. Barrett
Dept. of Forest Resources, OSU, Corvallis, OR
PNW Research Station, Anchorage, AK
Presented at Nearest Neighbors Workshop,
Minneapolis, MN
Aug. 28, 2006

2
Overview

I) Background
- Nearest Neighbor (NN) Imputation
II) Variance Estimation Methods for imputed data
III) Numerical Example
- Objectives/Methods/ Results /Summary
IV) Challenges and Opportunities
V) References

3
I. Background

For natural resource planning, forested land is
divided into polygons (stands) with same age,
species composition, etc.
Complete census is obtained for aerial variables
(X), using photos remote sensing
Ground based inventory data (e.g., tree-lists)
are available for some stands.

4
Nearest neighbor imputation

Non-sampled polygons lack ground data ? missing
by design
NN methods (e.g., MSN, KNN, GNN, etc.) are used
to populate forested land with detailed
ground-based information (Y).
For landscape level analysis, observed and
imputed values are used to estimate point (e.g.,
means and totals) and confidence intervals
(reliability).
Imputation brings additional variance over the
sampling variance.

5
Variance estimation for imputed data

Treating imputed values as observed values and
using ordinary variance formulas yield biased and
inconsistent variance estimates.
? invalid inferences and reliability estimates
For design-based and other surveys, variance
estimation for imputed data has been examined by
Shao and Sitter (1996), Montaquila (1997), Sitter
and Rao (1997), Chen and Shao (2001), etc.

6
Why is variance estimation for imputed data
important?

to make valid inferences and reliability
estimates

The oldest and simplest device for misleading
folks is the barefaced lie. A method that is
nearly as effective and far more subtle is to
report a sample estimate without any indication
of its reliability
(Frank Freese 1967, p. 11)

7
II. Variance Estimation Methods Used for Imputed
Data

Model-assisted approaches (Rancourt et al. 1999)
Multiple imputation (Rubin 1996)
Replication methods such as jackknife (Chen and
Shao 2001) and bootstrap (Shao and Sitter 1996)
All cases imputation variance estimator
(Montaquila 1997)
(1) loses the non parametric nature of NN
imputation, (2) is mainly used for random
imputation, and (3) require extensive
computation, but provide valid variance estimates
for NN imputed data.

8
II. Variance Estimation Methods (Contd) 3.
Replication Methods

Pros
do not require any explicit model or variance
estimator, non-parametric
do not depend on mechanism or process of missing
data. In some conditions, they might be the only
method for estimating reliability

Cons
do not partition components of the variance
estimate sampling error and imputation error
fail when sample size is small
computationally intensive

9
The Process for Generating Bootstrap Samples

From the combined (observed and imputed) data,
randomly select a sample of size n with
replacement (n is the no. of polygons).
Compute a bootstrap mean and variance
using the bootstrap samples.
Repeat steps (1) and (2) k times.
The Shao-Sitter bootstrap variance estimation
method requires that each bootstrap sample should
have a similar proportion of reference and target
polygons, as the combined data.

10
II. Variance Estimation Methods (Contd) 4. All
Cases Imputation (ACI, after Montaquila 1997)

Imputes a variable of interest to both sampled
non-sampled polygons, and then uses the
relationships of imputed values to observed
values for the sampled polygons to estimate
variance.

Population total (?) to be estimated (under
simple random sampling)

R sampled (reference) polygons T non-sampled
(target) polygons ej imputation errora area
of a polygons planning area indicates
imputed value
If all polygons are sampled,
11
4. All Cases Imputation Variance Estimator
(contd)
Let
The variance estimator for population total can
be decomposed as
A nonzero covariance exists between repeatedly
selected reference polygons and imputation
error.
12
4. All Cases Imputation Variance Estimator
(contd)
Since yi is not observed for target polygons, the
imputation error variance and covariance are
estimated using sampled polygons.
The AIC variance estimator for the population
total
Sampling error
Imputation error
Imputation covariance error
13
4. All Cases Imputation Variance Estimator
(contd)

Advantages
helps to identify components of the variance
estimate (sampling error, imputation error, and
imputation covariance error)
can be easily extended to different sampling
designs
not computationally intensive

14
III. Numerical Example

Objectives
Examine the performance of selected variance
estimation methods in the presence of NN imputed
data.
Examine components of variance estimated for NN
imputed data.

15
III. Study area
16
Methods

Data
An average of 4 species in a polygon were
observed.
326 polygons were selected and ground ( of
trees/ha, basal area/ha, and volume/ha) and
aerial (stand age, slope, aspect, site index,
Douglas fir, big leaf maple, etc.) variables
were extracted.

Correlations, 326 Stands
17
Data Summary, 326 Stands
18
Most similar neighbor (MSN) imputation

Data were split into reference (with X and Y
sets) and target (only with X sets) polygons
Most similar neighbor (MSN) (Moeur and Stage
1995) imputation was used to extend ground data
to non-sampled polygons

19
Variables used to impute ground based inventory
variables
20
MSN imputation (contd)

Three imputation rates 20, 50, and 80, were
examined to extend ground data to non-sampled
polygons
After MSN imputation, five variance estimation
methods were examined
1 Naïve variance estimator
2 All Cases Imputation (ACI) variance estimator
3 Naïve Jackknife
4 Naïve Bootstrap
5 Shao-Sitter Bootstrap

21
For each imputation rate

variance was estimated and the performance of the
ACI, jackknife, and bootstrap methods were
compared for each ground variable (BA, VOL, and
TPH)
total variance was partitioned into sampling
error, imputation error, and imputation
covariance error

22
Results
23
Variance of the mean for ground variables for 50
imputation rate

Sizeable differences were observed among the
variance estimation methods.
The variance estimated using the ACI and
Shao-Sitter methods were higher than those
estimated by the naïve jackknife and bootstrap
methods.

24
Results (Contd) Components of Variance (Std2)

The variance of imputation error (Imp) and
imputation covariance error (Cov_imp) increased
with an increase of imputation rate.
The variance of imputation error is not severe
when the imputation rate is low.

25
Summary

The ACI variance estimator partitions the total
variance. As expected, sampling error,
imputation error, and imputation covariance error
increased with an increase of imputation rate.
The magnitude of the imputation variance guides
future forest inventory and planning endeavors.
In selecting a variance estimator, trade-offs
between accuracy, cost, and simplicity should be
considered (Wolter 1985).

26
IV. Challenges and opportunities

1. Extending the ACI variance estimator to
stratified and multi-stage sampling designs and
multivariate data.
2. Examining other replication methods (e.g.,
partial replication methods).
3. Updating variance estimate in presence of over
time.

27
V. References

Chen, J. and Shao, J. 2001. Jackknife variance
estimation for nearest-neighbor imputation. J. of
the American Statistical Association, Vol. 96,
No. 453 260-269.
Freese, F. 1967. Elementary Statistical Methods
for Foresters. US Dept. of Agriculture, Forest
Service. Agriculture Handbook 317. 87 pp.
Moeur, M. and A.R. Stage. 1995. Most similar
neighbour an improved sampling inference
procedure for natural resource planning. For.
Sci. 41 337-359.
Montaquila, J. 1997. A new approach to variance
estimation in the presence of imputed data. PhD
dissertation. American university. 149 pp.
Shao, J. and Sitter, R.R. 1996. Bootstrap for
imputed survey data. Journal of the American
Statistical Association. 91 No. 435 1278-1287.
Rancourt, E. 1999. Estimation with nearest
of-neighbor imputation at Statistics Canada.
Proceedings of the Section on Survey Research
Methods, American Statistical Association,
131-138.
Rubin, D.B. 1996. Multiple imputation after 18
years. Journal of the American Statistical
Association. 91 473-489.
Wolter, K. M. 1985. Introduction to variance
estimation. Springer, NY, 427 pp.