Using Stata for Subpopulation Analysis of Complex Sample Survey Data - PowerPoint PPT Presentation

About This Presentation
Title:

Using Stata for Subpopulation Analysis of Complex Sample Survey Data

Description:

Kish's Taxonomy of Subclasses ... See Kish (1987), Statistical Design for Research ... Consider Kish's Taxonomy when determining an appropriate subclass ... – PowerPoint PPT presentation

Number of Views:775
Avg rating:3.0/5.0
Slides: 36
Provided by: sakw
Category:

less

Transcript and Presenter's Notes

Title: Using Stata for Subpopulation Analysis of Complex Sample Survey Data


1
Using Stata for Subpopulation Analysis of Complex
Sample Survey Data
  • Brady T. West
  • PhD Student
  • Michigan Program in Survey Methodology
  • July 30, 2009 2009 Stata Conference

2
Presentation Outline
  1. Introduction Subclass Analysis Issues
  2. Kishs Taxonomy of Subclasses
  3. Two Alternative Approaches to Inference
  4. Variance Estimation and Methods for Singletons
  5. Examples using NHANES and NHAMCS Data
  6. Suggestions for Practice
  7. Directions for Future Research

3
Subclass Analysis Issues
  • Analysts of large, complex sample survey data
    sets are often interested in making inferences
    about subpopulations of the original population
    that the sample was selected from (e.g.,
    Caucasian Females)
  • These subpopulations are referred to
    interchangeably in various literatures as
    subgroups, subclasses, subpopulations, domains,
    and subdomains, leading to confusion among
    analysts of survey data

4
Subclass Analysis Issues, contd
  • Software procedures for analysis of complex
    sample survey data are becoming more powerful,
    flexible, and widely available, offering analysts
    several options
  • Analysts need to be careful when analyzing
    subclasses, and be aware of the alternative
    approaches to subclass analysis that are possible
    and their implications for inference

5
Kishs Taxonomy of Subclasses
  • Design Domains Restricted to specific strata
    according to the complex sample design (usually
    geographically, e.g., Texas)
  • Cross-Classes Broadly distributed (in theory)
    across the strata and primary sampling units
    defining a complex sample (e.g.,
    African-Americans over age 50)
  • Mixed Classes Disproportionately distributed
    across the complex sample design (e.g., Hispanics
    in a sample including Los Angeles as a stratum)
  • See Kish (1987), Statistical Design for Research

6
Design DomainsX Sample Element in Subclass
Stratum PSU 1 PSU 2
1 XXXXXXXXXXX XXXXXXXXX
2 XXXXXXXXXX XXXXXXXXXXXX
3
4
5
7
Cross-Classes
Stratum PSU 1 PSU 2
1 XXXXXXXXXXXX XXXXX
2 XXXX XXXXXXX
3 XXXXXXXXXXX XXXXXXXXX
4 XXXXXX XXXXX
5 XXXXXXXXXX XXXXXXXXXXXX
8
Mixed Classes
Stratum PSU 1 PSU 2
1 XXXXXXXXXXXXXX XXXXXXXXXXXXX
2 X
3 XXXXXXXXXXXXX XXXXXXXXXX
4 XX
5 XXXXXXXXXXXXXX XXXXXXXXXXXX
9
Applying Kishs Taxonomy
  • The type of subclass is critical for determining
    an appropriate analysis approach
  • Two possible approaches to inference motivated by
    the taxonomy
  • 1. Unconditional approach (cross-classes, mixed
    classes)
  • 2. Conditional approach (design domains)

10
The Unconditional Approach
  • Appropriate for Cross-Classes, and in some cases
    Mixed Classes the subclass of interest
    theoretically can appear in all design strata and
    primary sampling units (PSUs)
  • KEY POINT Allow the software to process the
    entire survey data set, and recognize all
    possible design strata and PSUs DO NOT delete
    sample cases not in the subclass!

11
The Unconditional Approach
  • Rationale estimated variances for sample
    estimates of subclass parameters (based on
    within-stratum variance between PSUs) need to
    reflect sample-to-sample variability based on the
    full complex design
  • In other words, if a particular subclass does not
    appear in a PSU in any given sample (although in
    theory it could have), that PSU should contribute
    0 to variance estimates, rather than be ignored
    completely!

12
The Unconditional Approach
  • Further, the subclass sample size in each stratum
    is going to be a random variable, and theoretical
    sample-to-sample variance in realizations of this
    random variable should be incorporated into any
    variance estimation procedures

13
The Unconditional Approach
  • If cross-classes (or in some cases mixed classes)
    are being analyzed, and PSUs where the subclass
    does not appear (by random chance) are deleted,
    problems arise
  • Some strata may appear to have only one PSU by
    design (preventing variance estimation unless an
    ad hoc approach is used)
  • Entire design strata may be dropped, impacting
    variance estimates and calculations of degrees of
    freedom

14
The Unconditional Approach General Stata Code
  • svy, subpop(indicator) command varlist, options
  • indicator an indicator variable for the subpop
    or an if condition, e.g., if male 1
  • svy mean, over(groupvar)
  • svy prop, over(groupvar)
  • Stata drops strata with no subpopulation
    observations from degrees of freedom calculations
  • Exercise repeat 10 times really fast

15
The Conditional Approach
  • Appropriate for Design Domains, where a subclass
    cannot appear outside of specific design strata
  • The rationale behind the unconditional approach
    no longer applies
  • Certain design strata should not contribute to
    variance estimation or calculation of degrees of
    freedom

16
The Conditional Approach
  • Restrict the analysis to only those design strata
    where the subclass of interest exists
  • Variance estimates reflecting sample-to-sample
    variability should only be based on those design
    strata where the subclass can appear (unlike the
    unconditional approach)
  • Subclass sample sizes in design domains are
    assumed to be fixed, by design

17
The Conditional Approach General Stata Code
  • svy command varlist if (condition), options
  • (condition) might be male 1, or a more complex
    combination of conditions (e.g., male 1 age
    gt 50 age lt 90)

18
Variance Estimation Methods
  • All of these issues are only relevant when using
    Taylor Series Linearization, which is a default
    for variance estimation in Stata
  • Conditional analyses are OK to perform when using
    replication methods, such as Balanced Repeated
    Replication or Jackknife Repeated Replication
    (Rust and Rao, 1996)

19
Ad-hoc Fixes for Singleton Clusters in Stata
10.1
  • Stata 10.1 provides users with four ad-hoc fixes
    for the problem where strata are identified with
    only a single ultimate cluster for variance
    estimation in a subpopulation analysis
  • Report Missing Standard Errors (not really a fix)
  • Treat Units as Certainty Units, which contribute
    nothing to the standard error
  • Scale Variance using Certainty Units, which uses
    the average variance from each stratum with
    multiple PSUs for each stratum with only a single
    PSU
  • Center at the Grand Mean, where the variance
    contribution comes from a deviation from the
    grand mean instead of the stratum mean

20
Example The NHANES Data
  • We first consider examples based on the NHANES II
    data set, collected from a nationally
    representative multistage probability sample of
    the U.S. population from 1976-1980 (oldie but a
    goodie)
  • Briefly, a sample of the U.S. population was
    given medical examinations in an effort to assess
    the health of the U.S. population

21
Example NHANES Analysis
  • Analysis Subclass African-Americans ages 50 and
    above (this is a cross-class of the U.S.
    population, which can theoretically appear in all
    design strata and PSUs)
  • Analysis Objective Estimate the mean systolic
    blood pressure of this subclass and an
    appropriate standard error
  • See West et al. (2007) for more details

22
Conditional ApproachStata Code for NHANES
Analysis
  • svyset ppsu pweight fwgtexam, strata(stratum)
    singleunit(missing)
  • svyset ppsu pweight fwgtexam, strata(stratum)
    singleunit(centered)
  • Also singleunit(certainty), singleunit(scaled)
  • gen b50subp (race 2 ager gt 50)
  • svy mean bpsyst if b50subp 1

23
Conditional Approach Results
Method Est. Mean TSL SE Design DF
Missing SE 144.09 . 50-29 21
Centered 144.09 1.66 50-29 21
Certainty 144.09 1.62 50-29 21
Scaled 144.09 1.90 50-29 21
24
Conditional Approach?
  • This approach would not be appropriate for this
    particular subclass
  • Computed standard errors would generally be
    biased downward, because additional sources of
    sample-to-sample variability are ignored when
    following this approach
  • Same issues apply for analytic models
  • Evidence that the scaled ad-hoc fix may be
    overly conservative!

25
Unconditional ApproachStata Code for NHANES
Analysis
  • svyset ppsu pweight fwgtexam, strata(stratum)
    singleunit(missing)
  • Note choice of single unit option does not
    matter when following this approach!
  • gen b50subp (race 2 ager gt 50)
  • svy, subpop(b50subp) mean bpsyst

26
Unconditional Approach Results
Method Est. Mean TSL SE Des. DF
Missing SE 144.09 1.66 58-29 29
Centered 144.09 1.66 58-29 29
Certainty 144.09 1.66 58-29 29
Scaled 144.09 1.66 58-29 29
Note Stata dropped three strata with no sample
units in the subpopulation.
27
Unconditional Approach?
  • This approach would be the appropriate choice for
    a cross-class such as African-Americans over the
    age of 50
  • Inferences are theoretically appropriate
  • Same idea for analytic models
  • Results suggest that the centered and
    certainty ad-hoc fixes for conditional analyses
    are reasonable

28
Example The NHAMCS Data
  • Analysis Subclass Visits to Emergency
    Departments (ED) by African-American men ages 60
    and above (this is another cross-class of the
    U.S. population, which can theoretically appear
    in all NHAMCS design strata and PSUs)
  • Analysis Objective Estimate the percentage of
    all ED visits by members of this subclass for
    dizziness and/or vertigo in 2004
  • See West et al. (2008) for more details

29
Stata Code for NHAMCS Analyses
  • svyset cpsum pweight patwt, strata(cstratm)
    singleunit()
  • generate subc (settype 3 sex 2 agecat
    5 race 2)
  • svy tabulate dizzyrfv if subc 1, se ci
    percent conditional
  • svy, subpop(subc) tabulate dizzyrfv, se ci
    percent unconditional

30
NHAMCS Analysis Results
Method Est. TSL SE Design DF
Missing SE 4.82 1.576 106
Centered 4.82 1.576 106
Certainty 4.82 1.576 106
Scaled 4.82 1.576 106
Unconditional 4.82 1.590 286
31
NHAMCS Analysis Implications
  • No problems with strata having only a single
    ultimate cluster ad-hoc fixes all give the same
    results
  • Weighted point estimates are identical
  • Substantially fewer design-based degrees of
    freedom when following the conditional approach
    the full complex design will not be reflected in
    estimation of sample-to-sample variance (many
    ultimate clusters are lost)
  • Conditional analysis assumes that each sample
    will be of fixed size n 397 for variance
    estimation purposes no random variance!
  • Conditional analysis results in overly liberal
    inferences

32
Suggestions for Practice
  • Consider Kishs Taxonomy when determining an
    appropriate subclass analysis approach
  • Utilize the appropriate software options for
    unconditional analyses when analyzing
    cross-classes
  • Be careful with missing values when creating the
    subpopulation indicator
  • The unconditional analysis approach generally
    works fine for both cases (when in doubt, use
    this approach)

33
Directions for Future Research
  • More appropriate calculation / estimation of
    design-based and effective degrees of freedom for
    sparse subclasses or mixed classes
  • Development of analytic theory for interval
    estimation when working with small subclasses,
    which does not rely on asymptotic results

34
References
  • Kish, L. 1987. Statistical Design for Research.
    New York Wiley.
  • Rust, K. F., and J. N. K. Rao. 1996. Variance
    estimation for complex surveys using replication.
    Statistical Methods in Medical Research 5
    283310.
  • West, B.T., Berglund, P., and Heeringa, S.G.
    2008. A Closer Examination of Subpopulation
    Analysis of Complex Sample Survey Data. The Stata
    Journal, 8(3), 1-12.
  • West, B.T., Berglund, P., and Heeringa, S.G.
    2007. Alternative Approaches to Subclass Analysis
    of Complex Sample Survey Data. Proceedings of the
    2007 Joint Statistical Meetings.

35
Questions / Thank You!
  • For additional questions, comments, or electronic
    copies of these slides or the papers, please send
    an email to bwest_at_umich.edu
Write a Comment
User Comments (0)
About PowerShow.com