Analysis of multiple informant/ multiple source data in Stata - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of multiple informant/ multiple source data in Stata

Description:

Joint research project with Nan Laird and colleagues, Harvard School of ... weighted estimating equation methodology of Robins et al (JASA, 1994) and Xie ... – PowerPoint PPT presentation

Number of Views:380
Avg rating:3.0/5.0
Slides: 39
Provided by: nichola64
Category:

less

Transcript and Presenter's Notes

Title: Analysis of multiple informant/ multiple source data in Stata


1
Analysis of multiple informant/multiple source
data in Stata
  • Nicholas J. Horton
  • Department of Mathematics
  • Smith College, Northampton MA
  • Garrett M. Fitzmaurice
  • Harvard University
  • nhorton at email.smith.edu
  • http//www.biostat.harvard.edu/multinform

2
Acknowledgements
  • Joint research project with Nan Laird and
    colleagues, Harvard School of Public Health
  • Jane Murphy and the Stirling County Study for use
    of their example dataset (see Horton et al AJE,
    2001 for more details)
  • Supported by NIH grant RO1-MH54693

3
Outline
  • Motivation for multiple source data
  • Examples of multiple sources/informants
  • Models for correlated multiple source data
  • Accounting for complex survey design
  • Accounting for incomplete/missing data
  • Example (Stirling County Study)
  • Conclusions

4
Why multiple source data?
  • to provide better measures of some underlying
    construct that is difficult to measure or likely
    to be missing
  • also known as multiple informant reports, proxy
    reports, co-informants, etc.
  • discordance is expected, otherwise there is no
    need to collect multiple reports
  • Statistical framework developed in (Horton and
    Fitzmaurice SIM tutorial, 2004)

5
Definition of multiple source data
  • data obtained from multiple informants or raters
    (e.g., self-reports, family members, health care
    providers, teachers)
  • or via different/parallel instruments or methods
    (e.g., symptom rating scales, standardized
    diagnostic interviews, or clinical diagnoses)
  • None of the reports is a gold standard
  • We consider multiple source data that are
    commensurate (multiple measures of the same
    underlying variable on a similar scale)

6
Examples of multiple source data
  • child psychopathology (ask parents, teachers and
    children about underlying psychological state)
  • service utilization studies (collect information
    from subjects and databases)
  • medical comorbidity (query providers and charts
    to assess medical problems)

7
Examples of multiple source data (cont.)
  • adherence studies (collect self-report of
    adherence, electronic pill caps MEMS plus
    pharmacy records)
  • nutritional epidemiology (utilize multiple
    dietary instruments such as food frequency
    questionnaires, 24-hour recalls, food diaries)

8
Incomplete/missing reports
  • Multiple source reports are commonly incomplete
    since, by definition, they are collected from
    sources other than the primary subject of the
    study
  • This missingness may be by design or happenstance
    (or both!)

9
Example missing source reports
  • Consider service utilization studies that collect
    information from subjects and databases
  • Subjects may be lost to follow-up (or only
    contacted periodically)
  • Databases may be incomplete (lack of consent,
    lack of appropriate coverage)

10
Analytic approach
  • Multiple sources can provide information on
    outcomes or predictors (risk factors)
  • Multiple source outcome what is the prevalence
    of child psychopathology? (measured using
    parallel parent and teacher reports)
  • Fitzmaurice et al (AJE, 1995), Horton et al
    (HSOR, 2002), Horton and Fitzmaurice (SIM
    tutorial, 2004)

11
Analytic approach (cont.)
  • Multiple source predictor what are the odds of
    developing depression in adulthood, conditional
    on parallel reports of anxiety (collected from a
    child and a parent)?
  • Examples Horton et al (AJE, 2001), Lash et al
    (AJE, 2003), Liddicoat et al (JGIM, 2004), Horton
    and Fitzmaurice (SIM tutorial, 2004)
  • We will focus on an example using multiple source
    predictors

12
Notation
  • Let Y denote a univariate outcome for a given
    subject
  • Let denote the lth multiple source
    predictor
  • Let Z denote a vector of other covariates for the
    subject
  • To simplify exposition, we consider two sources
    with dichotomous reports (L2)

13
Questions to consider
  • Are the sources reporting on the same underlying
    construct (are they commensurate or
    interchangeable?)
  • Is it possible to combine the reports in some
    fashion?
  • How to handle missing reports?

14
Analytic approaches
  • Reviewed in Horton, Laird and Zahner (IJMPR,
    1999)
  • Use only one source
  • Fit separate models

15
Analytic approaches (cont.)
  • Combine (pool) the reports in some fashion
  • Include both reports in the model

16
Analytic approaches (cont.)
  • We considered simultaneous estimation of the
    marginal models
  • Non-standard application of GEE
  • Method independently suggested by Pepe et al
    (SIM, 1999)

17
Advantages of new approach
  • can be used to test for source differences in
    association with the outcome
  • can test if the effects of other risk factors on
    the outcome differ by source

18
Advantages of new approach
  • different source effects where necessary
  • a pooled model can be fit if no significant
    source effects (potentially more efficient)
  • can be fit using general purpose statistical
    software (Stata and others)

19
Accounting for survey design
  • Many health services or epidemiologic studies
    arise from complex survey samples
  • Need to address stratification, multi-stage
    clustering and unequal sampling weights
  • Failing to properly account for survey design may
    lead to bias and incorrect estimation of
    variability

20
Accounting for survey design (cont.)
  • Estimation proceeds using the approximate (quasi)
    log-likelihood (weighted version of the usual
    score equations for a GLM, accounting for the
    multi-stage clustering, including multiple source
    reports)
  • Can be fit using general purpose statistical
    software (elegant and powerful implementation in
    Stata)

21
Accounting for incomplete source reports
  • Missing source reports are missing predictors
  • Use weighted estimating equation methodology of
    Robins et al (JASA, 1994) and Xie and Paik
    (Biometrics, 1997), applied by Horton et al,
    (AJE, 2001)
  • Adds an additional missingness weight
  • Complications to variance estimation

22
Example Stirling County Study
  • Outcome time to event (death) over 16 year
    follow-up period (1952-1968) (n1079)
  • multiple source predictors partially observed
    dichotomous physician report or self report of
    psychiatric disorder (dpax)
  • other predictors age (3 categories), gender
  • statistical model piecewise exponential survival
    with 4 intervals each of 4 years duration
    (subjects contribute time at risk in each
    interval)

23
Stirling County survey design
Strata 1
Stratum 1
Stratum k
Stratum K
PSU 1
PSU J
PSU j
self- report
phys.- report
24
Implementation in Stata
  • Specify probability sampling unit (subject),
    probability sampling weights (weight) and
    stratification variable (district)
  • svyset id pweightweight, strata(district)
  • Describe the sampling design
  • svydes

25
  • Survey Describing stage 1 sampling units
  • pweight weight
  • VCE linearized
  • Strata 1 district
  • SU 1 id
  • FPC 1 ltzerogt

  • Obs per Unit
  • --------------------
    --------
  • Stratum Units Obs min mean
    max
  • -------- -------- -------- -------- --------
    --------
  • 1 93 654 2 7.0
    8
  • 2 37 284 4 7.7
    8
  • 3 51 346 2 6.8
    8
  • 4 202 1488 2 7.4
    8
  • 5 291 2104 2 7.2
    8
  • 6 128 946 2 7.4
    8
  • 7 50 374 4 7.5
    8

26
Implementation in Stata (cont.)
  • xi svy poisson event dpax int1 int2 int3 female
    ageind1 ageind2 diag i.diagageind1
    i.diagageind2
    i.dpaxfemale i.dpaxageind1
    i.dpaxageind2 i.dpaxdiag, exposure(atrisk)

27
Implementation in Stata (cont.)
  • Can then test for significant informant effects
    (any term with dpax self-report in the model)

test dpax0 test _IdpaXfemal_1, accumulate test
_IdpaXagein_1, accumulate test _IdpaXageina1,
accumulate test _IdpaXdiag_1, accumulate
28
Results (separate parameters)
  • Initially fit model with separate parameters
  • No evidence for source interactions
  • Implies that the association between risk factors
    and mortality did not differ by source
  • Dropped these terms from the model, yielding
    parsimonious shared parameter model with smaller
    standard errors

29
Implementation (shared parameter)
  • xi svy poisson event int1 int2 int3 female
    ageind1 ageind2 diag i.diagageind1
    i.diagageind2, exposure(atrisk)

30
Results (shared parameter)
  • Survey Poisson regression
  • Number of strata 9
    Number of obs 7420
  • Number of PSUs 1079
    Population size 64723.522

  • Design df 1070

  • F( 9, 1062) 21.94

  • Prob gt F 0.0000
  • --------------------------------------------------
    ----------------------------
  • Linearized
  • event Coef. Std. Err. t
    Pgtt 95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • int1 -.9594993 .2058191 -4.66
    0.000 -1.363354 -.5556444
  • int2 -.5680445 .1936756 -2.93
    0.003 -.9480716 -.1880174
  • int3 -.360743 .2002561 -1.80
    0.072 -.7536821 .0321962
  • female -.1298938 .1493215 -0.87
    0.385 -.42289 .1631024
  • ageind1 2.484883 .2820244 8.81
    0.000 1.931499 3.038266
  • ageind2 3.530875 .2894511 12.20
    0.000 2.962919 4.098831
  • diag 1.62166 .3256041 4.98
    0.000 .982765 2.260555

31
Results (shared parameters)
Parameter (log MRR) Estimate (SE)
female -0.13 (0.15)
mid-age 2.48 (0.28)
older-age 3.53 (0.33)
diagnosis 1.62 (0.33)
diagnosismid-age -1.35 (0.38)
diagnosisolder-age -1.31 (0.46)
32
Interpretation of results (annual mortality rate)
Age lt 50 Age gt 70
Diagnosis0 0.001 0.056
Diagnosis1 0.007 0.093
33
Results (2 df test of interaction of age and
diagnosis)
  • . test _IdiaXagein_10
  • Adjusted Wald test
  • ( 1) event_IdiaXagein_1 0
  • F( 1, 1070) 12.65
  • Prob gt F 0.0004
  • . test _IdiaXageina1, accumulate
  • Adjusted Wald test
  • ( 1) event_IdiaXagein_1 0
  • ( 2) event_IdiaXageina1 0
  • F( 2, 1069) 6.67
  • Prob gt F 0.0013

34
Results (calculation of MRR and 95 CI)
  • . lincom diag, eform
  • ( 1) eventdiag
  • --------------------------------------------------
    ----------------
  • event exp(b) Std.Err. t Pgtt 95
    Conf. Interval
  • -------------------------------------------------
    ----------------
  • (1) 5.0615 1.6480 4.98 0.000
    2.6718 9.5884
  • --------------------------------------------------
    ----------------
  • . lincom diag _IdiaXagein_1, eform
  • ( 1) eventdiag event_IdiaXagein_1 0
  • --------------------------------------------------
    ----------------
  • event exp(b) Std.Err. t Pgtt 95
    Conf. Interval
  • -------------------------------------------------
    ----------------
  • (1) 1.3102 .25297 1.40 0.162
    .89703 1.9137
  • --------------------------------------------------
    ----------------

35
Conclusions
  • new methods of analysis of multiple source data
    are available
  • can be implemented using existing software
  • methods allow the assessment of the relative
    association of each source
  • each source yielded similar conclusions
    association between psychiatric disorder and
    mortality is stronger for younger subjects
  • unified model has less variability, pools
    information after testing for systematic
    differences

36
Conclusions (cont.)
  • methods account for complex survey designs
  • methods incorporate partially observed subjects
    to contribute, under MAR (Little and Rubin book)
    assumptions
  • multiple source reports arise in many settings
    (not just for children anymore!)

37
Future work
  • Maximum-likelihood estimation instead of GEE
    approach
  • May yield efficiency gains
  • Particularly useful for missing reports
  • Non-commensurate reports
  • Different scales
  • Different underlying constructs
  • Consider latent variable models (e.g. work of
    Normand and colleagues)
  • See also gllamm and forthcoming Stata book by
    Rabe-Hesketh and Skrondal)

38
Analysis of multiple informant/multiple source
data in Stata
  • Nicholas Horton
  • Department of Mathematics
  • Smith College, Northampton MA
  • nhorton at email.smith.edu
  • http//www.biostat.harvard.edu/multinform
Write a Comment
User Comments (0)
About PowerShow.com