Title: Covariate information in complex event history data some thoughts arising from a case study
1Covariate information in complex event history
data -some thoughts arising from a case study
- Elja Arjas
- Department of Mathematics and Statistics,
University of Helsinki - and
- National Public Health Institute (KTL)
- Based on ongoing joint work with Olli Saarela and
Sangita Kulathinal
2Background and motivation
- Assessment of risk factors of cardiovascular
diseases (e.g. coronary heart disease, stroke) - Traditional approach for cohort analysis hazard
regression model, with covariates (e.g. blood
pressure, cholesterol level, or body mass index)
measured only at the baseline - Adding a genetic component usually candidate
loci, potentially causative on the basis of the
available information about their function.
3Emphasis on causal ideas
- Stressing probabilistic predictions How would
the probability of the outcome change if a
covariate would have a different value? - Association vs. causation the issue of
confounding (change by intervention,
do-conditioning, Pearl 2000).
4Cosidering causal effects
- Compare, e.g., predictive probabilities of future
response y - p(ydata, attrib, hist, do(exposure))
- vs.
- p(ydata, attrib, hist, do(exposure))
- for a generic individual (or, for an
equivalence class of exchangeable individuals)
characterized by attributes and past history used
in conditioning (cf. Arjas and Parner 2004).
5Causal ideas
- Causal mechanisms can involve pathways that are
- direct in the sense that they influence, in the
postulated model structure, directly the outcome
variable, or - indirect in that their effect on the outcome is
mediated via the levels of the measured risk
factors.
6MORGAM study
- Evans et al. (2005)
- Individuals of different ages in a cohort are
monitored for - (fatal and non-fatal) occurrences of coronary
heart disease (CHD) or stroke, - death from other causes.
- Information on risk factors such as
- smoking status,
- blood pressure (BP),
- body mass index (BMI),
- total cholesterol and HDL cholesterol and
- possible earlier occurrences (yes/no) of CHD or
stroke - is collected at cohort baseline.
7Genetic information
- SNP (single nucleotide polymorphism) level
genotype data from candidate loci, e.g. - functionally connected e.g. to blood clotting,
- associated with cardiovascular diseases,
- associated with increased lipid levels.
- Due to the cost involved genotyping is only done
on - all known cases of CHD or stroke, and
- individuals belonging to a random subset of the
original cohort.
8Information missing
- There is
- no genetic information of any kind available on
most members of the original cohort, and even for
those belonging to the case-cohort set, only on
the chosen candidate loci - no knowledge of early fatal occurrences of CHD or
stroke from outside the cohort.
9Graphical representation
event endpoint
parameters of interest
time (age)
underlying covariate process
candidate gene
measure- ment error variance
covariate measurement
10Aspects to be considered...
- Time
- BMI, BP and cholesterol level do not remain
constant over time individually varying
stochastic processes. - Even an accurate measurement at a particular time
cannot be directly related to the endpoints as a
"cause. - The interpretation, and value for a causal
analysis, of covariate measurements made in the
past will generally depend on how long ago they
were measured.
11Further aspects
- Feed back to covariate values from earlier
events - Covariate values of individuals who had
experienced a CHD event or stroke already before
being recruited to the cohort may have been
influenced by this event (e.g., the person quits
smoking, changes diet, or gets medication to
lower blood pressure). - Influence of an earlier treatment
- After a first occurrence of non-fatal CHD or
stroke, the risk for later similar events or
death is likely to be more strongly influenced by
the availability and success of the acute medical
treatment than by the values of the measured risk
factors/covariates.
12Further aspects
- Potential confounding issue
- The considered candidate loci can influence both
the values of the measured covariates and those
of the outcome variables. If this is not properly
accounted for in the modelling and analysis of
data, they become a potential source of
confounding in an observational study. -
- Here also How about the rest of the
genome, outside the selected candidate loci?
13Further aspects
- Large dimension of parameter space
- The degree of SNP-based polymorphisms present in
the data generally exceeds by far numbers for
which it would be possible, given the amounts of
data, to reliably estimate risks associated with
individual genotypes. - Particularly problematic in this sense
is the - MHC/HLA region.
14Some shortcuts
- Problem 2
- Ignore the current status covariate information
that may have been influenced by the earlier
occurrence, only keeping information on
covariates that do not change in time (age, sex,
genotype). - Problem 3
- Consider follow-up data only up to the first
occurrence of CHD or stroke. - Problems 1, 4 and 5
- Try something more systematic For problem 5,
apply a monotonicity postulate and consequent
partial ordering of risks. For problems 1 and 4,
treat the missing covariate information in a
distributional form (using data augmentation and
MCMC).
15Problem 5 dimension
- Partial ordering
- The two variants (alleles) of a biallelic SNP are
labeled as 0 and 1, with 0 for the "common and 1
for the "rare form - Within each gene (more generally, linkage group),
arrange the sequence of SNP genotypes (pairs of
the form 00, 01, 10 and 11), each determined from
the same SNP locus, into haplotypes. (Alleles
belonging to the same - maternal or paternal -
chromosome form a haplotype.)
16Problem 5 dimension (2)
- Denote (-,ø,) to indicate less risky,
neutral and more risky allele, respectively. - For each pair of alleles, there are three
possibilities - allele 0 is less risky than allele 1 (-),
- no effect (øø) and
- allele 1 is less risky than 0 (-).
- Postulate this ordering of alleles is extendible
to a partial ordering of haplotype risks. For
example, haplotype h1 is more risky than
haplotype h2 if all its alleles are either more
risky or neutral compared to the corresponding
alleles in h2, and at least one is more risky. - Haplotypes can then be classified into groups,
each being represented by a vector with elements
chosen from -,ø,. Modelling of risks is then
done via such classes. - Extend this partial ordering into a partial
ordering between to haplotype pairs (diplotypes).
17Problem 5 dimension (3)
18Problem 5 dimension (4)
19Problem 5 dimension (5)
event endpoint
genotype
diplotype
restrictions for parameters from the allele
ordering
population haplotype frequencies
ordering of alleles of causal loci
number of causal loci
location of causal loci
20Problem 1 time
- Regression dilution
- Measuring time dependent and individually
varying covariates (such as BP, cholesterol level
and BMI) at a single time point generally leads
to an under-estimation of the effect size. - But what should one do if for each individual
there is only a single covariate measurement in
the data?
21Problem 1 time (2)
- Modelling the underlying covariate process
- For dealing with time dependent covariates in an
explicit form, one needs a generator (stochastic
intensities) for the covariate process considered
as a function of pre-t histories, as well as
corresponding stochastic intensities for the end
point (TX) itself. - One possibility is to apply the Marked Point
Process (MPP) framework. The considered end
point, with a corresponding description of the
outcome, can then be imbedded into this process
in a natural way as a marked point (TX).
22Problem 1 time (3)
- Measurement error
- If also the covariate measurements involve a
random error, we need a measurement model. The
model parameters can be estimated if there are
additional data available on the progression of
the covariates. - Numerical implementation
- Using MCMC and data augmentation methods but
practical implementation can be difficult. - Dependence of the covariates on genotype
information? - Fortunately, only long time averages of
covariates are likely to be of importance for the
considered endpoints. But potential confounding
problem remains.
23Problem 4 missing data, confounding
- Genetic factors are potential confounders in
causal questions. If the relevant genotype
information is known and its role has been
properly accounted for in the statistical model,
this problem can be dealt with by proper
conditioning on such information. - But what to do when a majority of the cohort
members, as in MORGAM, have not been genotyped? - Usual solution restrict the analysis only to
those individuals who have been genotyped. But
then the relevant follow-up and covariate
information that exists on the other cohort
members will not be used in the analysis at all.
24Problem 4 missing data, confounding
- Treat also problem 4 as a missing data problem,
considering a probability model for the missing
genotypes and applying "full likelihood and
Bayesian inference (Kulathinal and Arjas 2006,
cf. Scheike and Martinussen 2004). This solution
involves considering the unknown genotypes in a
distributional form. - Note, however, that a person's genotype, the
measured risk factors and phenotype (time to
event and event type) may all be statistically
dependent of each other. Therefore the likelihood
contribution from an individual who has not been
genotyped involves an integration with respect to
a (conditional) genotype distribution (which is
generally different for different individuals).
25Problem 4 missing data, confounding
- In general, and depending on the information
available, one can consider different levels of
conditioning in the predictive probabilities - p(ydata, attrib, hist,
do(exposure)). - Depending on such a level, the interpretation of
the results from causal analysis will differ,
with more detailed conditioning taking us closer
to individual causal effect - which, however,
can never be achieved by a statistical analysis
of data. -
26Problem 4 missing data, confounding
- More detailed conditioning is also attractive as
a recipe against potential confounders (no
unmeasured confounders postulate). - Playing with finer level conditioning by using
latent variable modelling can be attractive, but
also risky if there is very little data, noisy
data, or no data at all to support such modelling
efforts. - In essence, such finer level predictive
probabilities are calibrated against data that
are actually observed.
27Take home-messages
- Careful consideration of sources of information
is important. - Interpretation of results is often facilitated by
establishing intuitive links to causal what if
ideas (do-conditioning). - Less emphasis on inference (particularly
statistical significance testing) concerning
individual regression coefficients.
28Take home-messages (2)
- General modelling approach based on MPPs is
useful, offering possibilities to consider
conditioning of probabilities on different levels
of information. - Bayesian approach, and applying MCMC for
numerical computations, provides a flexible
framework for statistical inference, keeping it
within the domain of probability.