Bojan%20Basrak - PowerPoint PPT Presentation

About This Presentation

Title:

Bojan%20Basrak

Description:

After two meiosis and. some other developments. X(t)=0, X(s)=1. X(t)= number of alleles ... locations of crossovers in meiosis are frequently modelled ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 39

Provided by: Bas90

Category:

more less

Transcript and Presenter's Notes

Title: Bojan%20Basrak

1
EXTREME VALUES, COPULAS AND GENETIC MAPPING

Bojan Basrak
Department of Mathematics,
University of Zagreb, Croatia

EVA 2005, Gothenburg
2
Genetic mapping

Genetic map gives the relative positions of genes
on the chromosomes with distances between them
typically measured in centimorgans (cM)
Linkage analysis aims to find approximate
location of genes associated with certain traits
in plants and animals.
It is a statistical method that compares genetic
similarity between two individuals (at a marker)
to similarity of their physical or psychological
traits (phenotype).
Among the most studied traits are inheritable
diseases.

3
QTL

Quantitative trait A measurable trait that shows
continuous variation, e.g. skin pigmentation,
height, cholesterol, etc.
Quantitative traits are normally influenced by
several genes and the environment.
QTL or quantitative trait locus a locus (or a
gene) affecting quantitative trait.
There is even The Journal of Quantitative Trait
Loci.

Genetic similarity between two individuals at a
given locus is typically measured by a number
called identity by descent (IBD) status.
Two genes of two different people are IBD if one
is a physical copy of the other, or if they are
both copies of the same ancestral gene.
For any two people IBD status is a number in the
set 0,1,2. In real-life, this number typically
needs to be estimated.

Linkage analysis is very effective with Mendelian
inheritance.
Mapping genes involved in inheritable diseases
can be done by comparing IBD status of affected
relatives (e.g. breast cancer)
Mapping QTLs in animals or plants is performed by
arranging a cross between two inbred strains,
which are substantially different in a
quantitative trait (e.g. tomato fruit mass or pH).

6
IBD status of two half sibs
Mother chromosomes
Chromosomes of two half sibs
Sib 1
After two meiosis and some other developments
Sib 2
X(t) number of alleles identical by descent
distance in Morgans
t s
X(t)0, X(s)1
7

Recombinations, or more specifically, locations
of crossovers in meiosis are frequently modelled
by a stochastic process (standard choice is the
Poisson process, suggested by Haldane in 1919.)
The process (X(t)) is an ON-OFF process in the
case of half-sibs, or sum of two independent such
processes in the case of siblings.
In particular, under Poisson process model,
(X(t)) is a stationary Markov process. Moreover,
X(t) is Bernoulli distributed for each t in the
case of half sibs.

In the Haldane model, we have
where
is the recombination probability.
For simplicity, we assume that IBD status is
known at each marker (i.e. markers are completely
genetically informative).

Human genome consists of over 3 109 basepairs
(in two copies) on 23 chromosomes. The average
length of a chromosome is 140 cM.
Total length of female (autosomal) genome is
4296cM
Total length of male genome is 2851 cM
That is there is 1 expected crossover over 105
Mb in males and over 88 Mb in females. Thus, on
human genome, 1 cM approximately equals 1Mb.

10
Data

From n sib-pairs we observe
- a sequence of iid phenotypes, with continuous
marginal distribution
and
- a sequence of iid processes

11
IBD 1 at t IBD 0 at t
12
Haseman-Elston

In 1972, they suggested to test whether there is
a linear regression with negative slope between
Soon, this became the standard tool for mapping
of QTLs in human genetics

13
Variance Components Model

Variance components model (Fulker and Cherny)
essentially assumes that the joint distribution
of the phenotypes is
bivariate normal, conditionally on the IBD status
x, with the same marginal distributions,
and the correlation

14
Linkage Analysis

The main question
Does higher IBD status mean stronger dependence
between the two trait values?
In variance components model this translates into
the test of Ho
against HA

15
Test statistic

Statistical test is based on the log-likelihood
ratio statistic
Or (equivalently) on the efficient score statistic

Where
is the score function, and
is appropriate entry of Fisher information matrix
and
needs to be estimated in practice.

17
Z(t)
tmax
18
Significance in genome-wide scans

If we have more than one marker we need to deal
with the issue of multiple testing. The solution
of this problem depends on the intermarker
spacings and the sample size.
One could use permutation tests or other
simulation based methods to obtain p-values.
If the sample size is large, one can apply a nice
asymptotic theory that determines significance
thresholds from the analysis of extremes of
certain Gaussian processes (see. Lander and
Botstein, Siegmund et al.)

For an illustration, we assume that the markers
are dense, that is IBD status is measured
continuously along the genome. It turns out that
under our assumptions and the null hypothesis one
can show that
where is Ornstein-Uhlenbeck process with mean
zero and covariance function
over each chromosome.

Now, approximate thresholds for a given
significance level can be obtained by studying
extremes of Ornstein-Uhlenbeck process (cf.
Leadbetter et al) over finite interval. Hence, we
get
For 23 human chromosomes with average length of
140 cM and significance level 0.05 we get
threshold b4.08 (3.62 on LOD scale).

21
Other models

The asymptotic theory does not change for other
more realistic models of the recombination
process (e.g. Kosambi model or chi squared
model), since the asymptotic results for extremes
of Gaussian processes depend only on the local
behavior of the autocorrelation function of the
process.
Howver, for all of these models it holds that
corr(Xs,Xt)1-rt-s as t-s converges to 0. So
in the limit we obtain Gaussian process with the
same behavior of autocorrelations.

22
Disadvantages

Normality assumption is frequently questionable
Correlation can be a very bad measure of
dependence if this assumption does not hold
Risch and Zhang (1995) show how
"The majority of such pairs provide little power
to detect linkage only pairs that are concordant
for high values, low values, or extremely
discordant pairs (for example, one in the top 10
percent and other in the bottom 10 percent of the
distribution) provide substantial power"

23
Copula

Copula of a random pair is the
distribution function C of the random vector
where we assume that the marginal distributions
F1 and F2 of Y1and Y2 are invertible. Hence the
marginal distributions of the copula are both
uniform on 0,1.
It is well known that the distribution of a
random pair splits into two marginal
distributions and the copula. Also copula is
invariant under continuous increasing
transformations.

It is straightforward to check that
i.e. the distribution of a random pair splits
into two marginal distributions and the copula
Copula is invariant under monotone
transformations, that is
have the same copula, for increasing function h.

25
Basic Examples
26
Linkage analysis rephrased

The main question
Does higher IBD status mean stronger dependence
between the two trait values?
could be rephrased as
Does higher IBD status mean that the two trait
values have more diagonalized copula?
Note marginal distributions do not change with
IBD status.

27
Normal Copula

Normal copula is a copula of a normally
distributed random vector. Thus, if
then the random vector has the bivariate
normal copula.
Since it depends only on we denote it by

28
Bivariate Normal Copula
29
New Model

Assume that the pair has
the same copula as in the variance components
model, i.e.
conditionally on the IBD status x
and the same (but arbitrary) continuous marginal
distribution i.e. F1 F2 .

The model is not so new after all, equivalently,
there is an h such that
satisfies the assumption of the v.c. model.
Suppose that has the standard normal
distribution function then
That is

We can proceed in two ways
we could guess (estimate) h, or
we could guess (estimate) F1
The first method is already frequently applied in
practice,
while the second one is easier to justify using
the empirical
distribution function of the phenotypes.
To estimate F1 we may use data from a larger
sample if
available.

32
Transformation

In practice we might have only 2n sib-pairs to
estimate marginal distribution. So we could use
Transformed phenotypes are

If , one can show the following
Theorem
as
Observe that we essentially use van der Waerden
normal scores rank correlation coefficient to
measure dependence between the traits.
Klaassen and Wellner (1997) showed that this is
asymptotically efficient estimator of the
correlation parameter in bivariate normal copula
model.

Hence, it is also efficient estimator of the
maximum correlation coefficient.
For a pair of random variables Y1 and Y2 ,
maximum correlation coefficient is defined as
where supremum is taken over all real
transformations a and b such that a(Y1) and b(Y2)
have finite nonzero variance.

35
Simulation study
36
Application - Lp(a)

Twin data on lipoprotein levels, collected in 4
populations in three countries (Australia, the
Netherlands, Sweden).
Analysis was performed using the variance
components method and published by Beekman et al.
(2003).

37
Ad hoc transformation
38
Lp(a) - chromosome 1
39
Lp(a) - chromosome 6
40
Discussion

The normal copula based method has correct
critical levels under the null hypothesis for any
marginal distribution. Its power seems to be
close to optimal.
The method easily extends to general pedigrees,
discrete data, multiple QTLs, etc.
It is straightforward to implement in any
existing software.
Other families of copulas (Clayton, Gumbel, etc.)
could be more suitable in certain applications.

41
Discrete data

In biomedical applications, phenotypes are
frequently measured on some ordinal scale that
is for some natural number l
If we want to detect if higher IBD status
translates into more similar phenotypic values we
may apply nonparametric methods or discretize
some parametric family of copulas, and test if
the parameters change with IBD status.

42
Discrete data
43
Acknowledgments