Covariate information in complex event history data some thoughts arising from a case study - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Covariate information in complex event history data some thoughts arising from a case study

Description:

... pressure, cholesterol level, or body mass index) measured only at the baseline; ... body mass index (BMI), total cholesterol and HDL cholesterol and ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 29

Provided by: eljaa

Category:

more less

Transcript and Presenter's Notes

Title: Covariate information in complex event history data some thoughts arising from a case study

1
Covariate information in complex event history
data -some thoughts arising from a case study

Elja Arjas
Department of Mathematics and Statistics,
University of Helsinki
and
National Public Health Institute (KTL)
Based on ongoing joint work with Olli Saarela and
Sangita Kulathinal

2
Background and motivation

Assessment of risk factors of cardiovascular
diseases (e.g. coronary heart disease, stroke)
Traditional approach for cohort analysis hazard
regression model, with covariates (e.g. blood
pressure, cholesterol level, or body mass index)
measured only at the baseline
Adding a genetic component usually candidate
loci, potentially causative on the basis of the
available information about their function.

3
Emphasis on causal ideas

Stressing probabilistic predictions How would
the probability of the outcome change if a
covariate would have a different value?
Association vs. causation the issue of
confounding (change by intervention,
do-conditioning, Pearl 2000).

4
Cosidering causal effects

Compare, e.g., predictive probabilities of future
response y
p(ydata, attrib, hist, do(exposure))
vs.
p(ydata, attrib, hist, do(exposure))
for a generic individual (or, for an
equivalence class of exchangeable individuals)
characterized by attributes and past history used
in conditioning (cf. Arjas and Parner 2004).

5
Causal ideas

Causal mechanisms can involve pathways that are
direct in the sense that they influence, in the
postulated model structure, directly the outcome
variable, or
indirect in that their effect on the outcome is
mediated via the levels of the measured risk
factors.

6
MORGAM study

Evans et al. (2005)
Individuals of different ages in a cohort are
monitored for
(fatal and non-fatal) occurrences of coronary
heart disease (CHD) or stroke,
death from other causes.
Information on risk factors such as
smoking status,
blood pressure (BP),
body mass index (BMI),
total cholesterol and HDL cholesterol and
possible earlier occurrences (yes/no) of CHD or
stroke
is collected at cohort baseline.

7
Genetic information

SNP (single nucleotide polymorphism) level
genotype data from candidate loci, e.g.
functionally connected e.g. to blood clotting,
associated with cardiovascular diseases,
associated with increased lipid levels.
Due to the cost involved genotyping is only done
on
all known cases of CHD or stroke, and
individuals belonging to a random subset of the
original cohort.

8
Information missing

There is
no genetic information of any kind available on
most members of the original cohort, and even for
those belonging to the case-cohort set, only on
the chosen candidate loci
no knowledge of early fatal occurrences of CHD or
stroke from outside the cohort.

9
Graphical representation
event endpoint
parameters of interest
time (age)
underlying covariate process
candidate gene
measure- ment error variance
covariate measurement
10
Aspects to be considered...

Time
BMI, BP and cholesterol level do not remain
constant over time individually varying
stochastic processes.
Even an accurate measurement at a particular time
cannot be directly related to the endpoints as a
"cause.
The interpretation, and value for a causal
analysis, of covariate measurements made in the
past will generally depend on how long ago they
were measured.

11
Further aspects

Feed back to covariate values from earlier
events
Covariate values of individuals who had
experienced a CHD event or stroke already before
being recruited to the cohort may have been
influenced by this event (e.g., the person quits
smoking, changes diet, or gets medication to
lower blood pressure).
Influence of an earlier treatment
After a first occurrence of non-fatal CHD or
stroke, the risk for later similar events or
death is likely to be more strongly influenced by
the availability and success of the acute medical
treatment than by the values of the measured risk
factors/covariates.

12
Further aspects

Potential confounding issue
The considered candidate loci can influence both
the values of the measured covariates and those
of the outcome variables. If this is not properly
accounted for in the modelling and analysis of
data, they become a potential source of
confounding in an observational study.
Here also How about the rest of the
genome, outside the selected candidate loci?

13
Further aspects

Large dimension of parameter space
The degree of SNP-based polymorphisms present in
the data generally exceeds by far numbers for
which it would be possible, given the amounts of
data, to reliably estimate risks associated with
individual genotypes.
Particularly problematic in this sense
is the
MHC/HLA region.

14
Some shortcuts

Problem 2
Ignore the current status covariate information
that may have been influenced by the earlier
occurrence, only keeping information on
covariates that do not change in time (age, sex,
genotype).
Problem 3
Consider follow-up data only up to the first
occurrence of CHD or stroke.
Problems 1, 4 and 5
Try something more systematic For problem 5,
apply a monotonicity postulate and consequent
partial ordering of risks. For problems 1 and 4,
treat the missing covariate information in a
distributional form (using data augmentation and
MCMC).

15
Problem 5 dimension

Partial ordering
The two variants (alleles) of a biallelic SNP are
labeled as 0 and 1, with 0 for the "common and 1
for the "rare form
Within each gene (more generally, linkage group),
arrange the sequence of SNP genotypes (pairs of
the form 00, 01, 10 and 11), each determined from
the same SNP locus, into haplotypes. (Alleles
belonging to the same - maternal or paternal -
chromosome form a haplotype.)

16
Problem 5 dimension (2)

Denote (-,ø,) to indicate less risky,
neutral and more risky allele, respectively.
For each pair of alleles, there are three
possibilities
allele 0 is less risky than allele 1 (-),
no effect (øø) and
allele 1 is less risky than 0 (-).
Postulate this ordering of alleles is extendible
to a partial ordering of haplotype risks. For
example, haplotype h1 is more risky than
haplotype h2 if all its alleles are either more
risky or neutral compared to the corresponding
alleles in h2, and at least one is more risky.
Haplotypes can then be classified into groups,
each being represented by a vector with elements
chosen from -,ø,. Modelling of risks is then
done via such classes.
Extend this partial ordering into a partial
ordering between to haplotype pairs (diplotypes).

17
Problem 5 dimension (3)
18
Problem 5 dimension (4)
19
Problem 5 dimension (5)
event endpoint
genotype
diplotype
restrictions for parameters from the allele
ordering
population haplotype frequencies
ordering of alleles of causal loci
number of causal loci
location of causal loci
20
Problem 1 time

Regression dilution
Measuring time dependent and individually
varying covariates (such as BP, cholesterol level
and BMI) at a single time point generally leads
to an under-estimation of the effect size.
But what should one do if for each individual
there is only a single covariate measurement in
the data?

21
Problem 1 time (2)

Modelling the underlying covariate process
For dealing with time dependent covariates in an
explicit form, one needs a generator (stochastic
intensities) for the covariate process considered
as a function of pre-t histories, as well as
corresponding stochastic intensities for the end
point (TX) itself.
One possibility is to apply the Marked Point
Process (MPP) framework. The considered end
point, with a corresponding description of the
outcome, can then be imbedded into this process
in a natural way as a marked point (TX).

22
Problem 1 time (3)

Measurement error
If also the covariate measurements involve a
random error, we need a measurement model. The
model parameters can be estimated if there are
additional data available on the progression of
the covariates.
Numerical implementation
Using MCMC and data augmentation methods but
practical implementation can be difficult.
Dependence of the covariates on genotype
information?
Fortunately, only long time averages of
covariates are likely to be of importance for the
considered endpoints. But potential confounding
problem remains.

23
Problem 4 missing data, confounding

Genetic factors are potential confounders in
causal questions. If the relevant genotype
information is known and its role has been
properly accounted for in the statistical model,
this problem can be dealt with by proper
conditioning on such information.
But what to do when a majority of the cohort
members, as in MORGAM, have not been genotyped?
Usual solution restrict the analysis only to
those individuals who have been genotyped. But
then the relevant follow-up and covariate
information that exists on the other cohort
members will not be used in the analysis at all.

24
Problem 4 missing data, confounding

Treat also problem 4 as a missing data problem,
considering a probability model for the missing
genotypes and applying "full likelihood and
Bayesian inference (Kulathinal and Arjas 2006,
cf. Scheike and Martinussen 2004). This solution
involves considering the unknown genotypes in a
distributional form.
Note, however, that a person's genotype, the
measured risk factors and phenotype (time to
event and event type) may all be statistically
dependent of each other. Therefore the likelihood
contribution from an individual who has not been
genotyped involves an integration with respect to
a (conditional) genotype distribution (which is
generally different for different individuals).

25
Problem 4 missing data, confounding

In general, and depending on the information
available, one can consider different levels of
conditioning in the predictive probabilities
p(ydata, attrib, hist,
do(exposure)).
Depending on such a level, the interpretation of
the results from causal analysis will differ,
with more detailed conditioning taking us closer
to individual causal effect - which, however,
can never be achieved by a statistical analysis
of data.

26
Problem 4 missing data, confounding

More detailed conditioning is also attractive as
a recipe against potential confounders (no
unmeasured confounders postulate).
Playing with finer level conditioning by using
latent variable modelling can be attractive, but
also risky if there is very little data, noisy
data, or no data at all to support such modelling
efforts.
In essence, such finer level predictive
probabilities are calibrated against data that
are actually observed.

27
Take home-messages

Careful consideration of sources of information
is important.
Interpretation of results is often facilitated by
establishing intuitive links to causal what if
ideas (do-conditioning).
Less emphasis on inference (particularly
statistical significance testing) concerning
individual regression coefficients.

28
Take home-messages (2)

General modelling approach based on MPPs is
useful, offering possibilities to consider
conditioning of probabilities on different levels
of information.
Bayesian approach, and applying MCMC for
numerical computations, provides a flexible
framework for statistical inference, keeping it
within the domain of probability.

Write a Comment

User Comments (0)