Covariate information in complex event history data some thoughts arising from a case study - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Covariate information in complex event history data some thoughts arising from a case study

Description:

... pressure, cholesterol level, or body mass index) measured only at the baseline; ... body mass index (BMI), total cholesterol and HDL cholesterol and ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 29
Provided by: eljaa
Category:

less

Transcript and Presenter's Notes

Title: Covariate information in complex event history data some thoughts arising from a case study


1
Covariate information in complex event history
data -some thoughts arising from a case study
  • Elja Arjas
  • Department of Mathematics and Statistics,
    University of Helsinki
  • and
  • National Public Health Institute (KTL)
  • Based on ongoing joint work with Olli Saarela and
    Sangita Kulathinal

2
Background and motivation
  • Assessment of risk factors of cardiovascular
    diseases (e.g. coronary heart disease, stroke)
  • Traditional approach for cohort analysis hazard
    regression model, with covariates (e.g. blood
    pressure, cholesterol level, or body mass index)
    measured only at the baseline
  • Adding a genetic component usually candidate
    loci, potentially causative on the basis of the
    available information about their function.

3
Emphasis on causal ideas
  • Stressing probabilistic predictions How would
    the probability of the outcome change if a
    covariate would have a different value?
  • Association vs. causation the issue of
    confounding (change by intervention,
    do-conditioning, Pearl 2000).

4
Cosidering causal effects
  • Compare, e.g., predictive probabilities of future
    response y
  • p(ydata, attrib, hist, do(exposure))
  • vs.
  • p(ydata, attrib, hist, do(exposure))
  • for a generic individual (or, for an
    equivalence class of exchangeable individuals)
    characterized by attributes and past history used
    in conditioning (cf. Arjas and Parner 2004).

5
Causal ideas
  • Causal mechanisms can involve pathways that are
  • direct in the sense that they influence, in the
    postulated model structure, directly the outcome
    variable, or
  • indirect in that their effect on the outcome is
    mediated via the levels of the measured risk
    factors.

6
MORGAM study
  • Evans et al. (2005)
  • Individuals of different ages in a cohort are
    monitored for
  • (fatal and non-fatal) occurrences of coronary
    heart disease (CHD) or stroke,
  • death from other causes.
  • Information on risk factors such as
  • smoking status,
  • blood pressure (BP),
  • body mass index (BMI),
  • total cholesterol and HDL cholesterol and
  • possible earlier occurrences (yes/no) of CHD or
    stroke
  • is collected at cohort baseline.

7
Genetic information
  • SNP (single nucleotide polymorphism) level
    genotype data from candidate loci, e.g.
  • functionally connected e.g. to blood clotting,
  • associated with cardiovascular diseases,
  • associated with increased lipid levels.
  • Due to the cost involved genotyping is only done
    on
  • all known cases of CHD or stroke, and
  • individuals belonging to a random subset of the
    original cohort.

8
Information missing
  • There is
  • no genetic information of any kind available on
    most members of the original cohort, and even for
    those belonging to the case-cohort set, only on
    the chosen candidate loci
  • no knowledge of early fatal occurrences of CHD or
    stroke from outside the cohort.

9
Graphical representation
event endpoint
parameters of interest
time (age)
underlying covariate process
candidate gene
measure- ment error variance
covariate measurement
10
Aspects to be considered...
  • Time
  • BMI, BP and cholesterol level do not remain
    constant over time individually varying
    stochastic processes.
  • Even an accurate measurement at a particular time
    cannot be directly related to the endpoints as a
    "cause.
  • The interpretation, and value for a causal
    analysis, of covariate measurements made in the
    past will generally depend on how long ago they
    were measured.

11
Further aspects
  • Feed back to covariate values from earlier
    events
  • Covariate values of individuals who had
    experienced a CHD event or stroke already before
    being recruited to the cohort may have been
    influenced by this event (e.g., the person quits
    smoking, changes diet, or gets medication to
    lower blood pressure).
  • Influence of an earlier treatment
  • After a first occurrence of non-fatal CHD or
    stroke, the risk for later similar events or
    death is likely to be more strongly influenced by
    the availability and success of the acute medical
    treatment than by the values of the measured risk
    factors/covariates.

12
Further aspects
  • Potential confounding issue
  • The considered candidate loci can influence both
    the values of the measured covariates and those
    of the outcome variables. If this is not properly
    accounted for in the modelling and analysis of
    data, they become a potential source of
    confounding in an observational study.
  • Here also How about the rest of the
    genome, outside the selected candidate loci?

13
Further aspects
  • Large dimension of parameter space
  • The degree of SNP-based polymorphisms present in
    the data generally exceeds by far numbers for
    which it would be possible, given the amounts of
    data, to reliably estimate risks associated with
    individual genotypes.
  • Particularly problematic in this sense
    is the
  • MHC/HLA region.

14
Some shortcuts
  • Problem 2
  • Ignore the current status covariate information
    that may have been influenced by the earlier
    occurrence, only keeping information on
    covariates that do not change in time (age, sex,
    genotype).
  • Problem 3
  • Consider follow-up data only up to the first
    occurrence of CHD or stroke.
  • Problems 1, 4 and 5
  • Try something more systematic For problem 5,
    apply a monotonicity postulate and consequent
    partial ordering of risks. For problems 1 and 4,
    treat the missing covariate information in a
    distributional form (using data augmentation and
    MCMC).

15
Problem 5 dimension
  • Partial ordering
  • The two variants (alleles) of a biallelic SNP are
    labeled as 0 and 1, with 0 for the "common and 1
    for the "rare form
  • Within each gene (more generally, linkage group),
    arrange the sequence of SNP genotypes (pairs of
    the form 00, 01, 10 and 11), each determined from
    the same SNP locus, into haplotypes. (Alleles
    belonging to the same - maternal or paternal -
    chromosome form a haplotype.)

16
Problem 5 dimension (2)
  • Denote (-,ø,) to indicate less risky,
    neutral and more risky allele, respectively.
  • For each pair of alleles, there are three
    possibilities
  • allele 0 is less risky than allele 1 (-),
  • no effect (øø) and
  • allele 1 is less risky than 0 (-).
  • Postulate this ordering of alleles is extendible
    to a partial ordering of haplotype risks. For
    example, haplotype h1 is more risky than
    haplotype h2 if all its alleles are either more
    risky or neutral compared to the corresponding
    alleles in h2, and at least one is more risky.
  • Haplotypes can then be classified into groups,
    each being represented by a vector with elements
    chosen from -,ø,. Modelling of risks is then
    done via such classes.
  • Extend this partial ordering into a partial
    ordering between to haplotype pairs (diplotypes).

17
Problem 5 dimension (3)
18
Problem 5 dimension (4)
19
Problem 5 dimension (5)
event endpoint
genotype
diplotype
restrictions for parameters from the allele
ordering
population haplotype frequencies
ordering of alleles of causal loci
number of causal loci
location of causal loci
20
Problem 1 time
  • Regression dilution
  • Measuring time dependent and individually
    varying covariates (such as BP, cholesterol level
    and BMI) at a single time point generally leads
    to an under-estimation of the effect size.
  • But what should one do if for each individual
    there is only a single covariate measurement in
    the data?

21
Problem 1 time (2)
  • Modelling the underlying covariate process
  • For dealing with time dependent covariates in an
    explicit form, one needs a generator (stochastic
    intensities) for the covariate process considered
    as a function of pre-t histories, as well as
    corresponding stochastic intensities for the end
    point (TX) itself.
  • One possibility is to apply the Marked Point
    Process (MPP) framework. The considered end
    point, with a corresponding description of the
    outcome, can then be imbedded into this process
    in a natural way as a marked point (TX).

22
Problem 1 time (3)
  • Measurement error
  • If also the covariate measurements involve a
    random error, we need a measurement model. The
    model parameters can be estimated if there are
    additional data available on the progression of
    the covariates.
  • Numerical implementation
  • Using MCMC and data augmentation methods but
    practical implementation can be difficult.
  • Dependence of the covariates on genotype
    information?
  • Fortunately, only long time averages of
    covariates are likely to be of importance for the
    considered endpoints. But potential confounding
    problem remains.

23
Problem 4 missing data, confounding
  • Genetic factors are potential confounders in
    causal questions. If the relevant genotype
    information is known and its role has been
    properly accounted for in the statistical model,
    this problem can be dealt with by proper
    conditioning on such information.
  • But what to do when a majority of the cohort
    members, as in MORGAM, have not been genotyped?
  • Usual solution restrict the analysis only to
    those individuals who have been genotyped. But
    then the relevant follow-up and covariate
    information that exists on the other cohort
    members will not be used in the analysis at all.

24
Problem 4 missing data, confounding
  • Treat also problem 4 as a missing data problem,
    considering a probability model for the missing
    genotypes and applying "full likelihood and
    Bayesian inference (Kulathinal and Arjas 2006,
    cf. Scheike and Martinussen 2004). This solution
    involves considering the unknown genotypes in a
    distributional form.
  • Note, however, that a person's genotype, the
    measured risk factors and phenotype (time to
    event and event type) may all be statistically
    dependent of each other. Therefore the likelihood
    contribution from an individual who has not been
    genotyped involves an integration with respect to
    a (conditional) genotype distribution (which is
    generally different for different individuals).

25
Problem 4 missing data, confounding
  • In general, and depending on the information
    available, one can consider different levels of
    conditioning in the predictive probabilities
  • p(ydata, attrib, hist,
    do(exposure)).
  • Depending on such a level, the interpretation of
    the results from causal analysis will differ,
    with more detailed conditioning taking us closer
    to individual causal effect - which, however,
    can never be achieved by a statistical analysis
    of data.

26
Problem 4 missing data, confounding
  • More detailed conditioning is also attractive as
    a recipe against potential confounders (no
    unmeasured confounders postulate).
  • Playing with finer level conditioning by using
    latent variable modelling can be attractive, but
    also risky if there is very little data, noisy
    data, or no data at all to support such modelling
    efforts.
  • In essence, such finer level predictive
    probabilities are calibrated against data that
    are actually observed.

27
Take home-messages
  • Careful consideration of sources of information
    is important.
  • Interpretation of results is often facilitated by
    establishing intuitive links to causal what if
    ideas (do-conditioning).
  • Less emphasis on inference (particularly
    statistical significance testing) concerning
    individual regression coefficients.

28
Take home-messages (2)
  • General modelling approach based on MPPs is
    useful, offering possibilities to consider
    conditioning of probabilities on different levels
    of information.
  • Bayesian approach, and applying MCMC for
    numerical computations, provides a flexible
    framework for statistical inference, keeping it
    within the domain of probability.
Write a Comment
User Comments (0)
About PowerShow.com