Coalescence - PowerPoint PPT Presentation

Loading...

PPT – Coalescence PowerPoint presentation | free to download - id: e0631-NmUzN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Coalescence

Description:

Haplotype trees are not new in population genetics; they have been around in the ... It is dangerous to equate a haplotype tree to a species tree. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 56
Provided by: alantem
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Coalescence


1
Coalescence
DNA Replication
DNA Coalescence
A coalescent event occurs when two lineages of
DNA molecules merge back into a single DNA
molecule at some time in the past.
2
Gene Tree (all copies of homologous DNA coalesce
to a common ancestral molecule)
COALESCENCE OF n COPIES OF HOMOLOGOUS DNA
3
Coalescence in an Ideal Population of N with
Ploidy Level x
  • Each act of reproduction is equally likely to
    involve any of the N individuals, with each
    reproductive event being an independent event
  • Under these conditions, the probability that two
    gametes are drawn from the same parental
    individual is 1/N
  • With ploidy level x, the probability of identity
    by descent/coalescence from the previous
    generation is (1/x)(1/N) 1/(xN)
  • In practice, real populations are not ideal, so
    pretend the population is ideal but with an
    inbreeding effective size of an idealized
    population of size Nef Therefore, the prob. of
    coalescence in one generation is 1/(xNef)

4
Sample Two Genes at Random
The probability of coalescence exactly t
generations ago is the probability of no
coalescence for the first t-1 generations in the
past followed by a coalescent event at generation
t
5
Sample Two Genes at Random
The average time to coalescence is
The variance of time to coalescence of two genes
(??ct) is the average or expectation of (t-xNef)2

6
Sample n Genes at Random
7
Sample n Genes at Random
8
Sample n Genes at Random
Once the first coalescent event has occurred, we
now have n-1 gene lineages, and therefore we
simply repeat all the calculations with n-1
rather than n. In general, the expected time and
variance between the k1 coalescent event and the
kth event is
9
Sample n Genes at Random
The average times to the first and last
coalescence are
2xNef/n(n-1) and 2xNef(1-1/n)
  • Let n 10 and x2, then the time span covered by
    coalescent events is expected to range from
    0.0444Nef to 3.6Nef.
  • Let n 100, then the time span covered by
    coalescent events is expected to range from
    0.0004Nef to 3.96Nef.
  • These equations imply that you do not need large
    samples to cover deep (old) coalescent events,
    but if you want to sample recent coalescent
    events, large sample sizes are critical.
  • For n large, the expected coalescent time for all
    genes is 2xNef

10
Sample n Genes at Random
The variance of time to coalescence of n genes
is
  • Note that in both the 2- and n-sample cases, the
    mean coalescent times are proportional to Nef and
    the variances are proportional to Nef2.
  • The Standard Molecular Clock is a Poisson Clock
    in Which the Mean Variance.
  • The Coalescent is a noisy evolutionary process
    with much inherent variation that cannot be
    eliminated by large ns it is innate to the
    evolutionary process itself and is called
    evolutionary stochasticity.

11
Buris Experi-ment on Genetic Drift
12
Fixation (Coalescence) Times in 105 Replicates of
the Same Evolutionary Process
Problem No Replication With Most Real Data
Sets. Only 1 Realization.
13
Evolutionary Stochasticity
Using the standard molecular clock and an
estimator of ? of 10-8 per year, the time to
coalescence of all mtDNA to a common ancestral
molecule has been estimated to be 290,000 years
ago (Stoneking et al. 1986). This figure of
290,000 however is subject to much error because
of evolutionary stochasticity. When evolutionary
stochasticity is taken into account (ignoring
sampling error, measurement error, and the
considerable ambiguity in ?), the 95 confidence
interval around 290,000 is 152,000 years to
473,000 years (Templeton 1993) -- a span of over
300,000 years!
14
Coalescence of a mtDNA in an Ideal Population of
N? haploids
  • Each act of reproduction is equally likely to
    involve any of the N? individuals, with each
    reproductive event being an independent event
  • Under these conditions, the probability that two
    gametes are drawn from the same parental
    individual is 1/N?
  • Under haploidy, the probability of identity by
    descent/coalescence from the previous generation
    is (1)(1/N?) 1/(N?)
  • In practice, real populations are not ideal, so
    pretend the population is ideal but with an
    inbreeding effective size of an idealized
    population of size Nef? Therefore, the prob. of
    coalescence in one generation is 1/(Nef?)

15
Expected Coalescence Times for a Large Sample of
Genes
16
Estimated Coalescence Times for 24 Human Loci
9
Uniparental Haploid DNA Regions
8
X-Linked Loci
7
6
Autosomal Loci
TMRCA (In Millions of Years)
5
4
3
2
1
0
FIX
CCR5
ECP
EDN
HFE
MX1
MAO
APLX
FUT6
FUT2
G6PD
MC1R
Y-DNA
mtDNA
Xq13.3
AMELX
PDHA1
MS205
Lactase
TNFSF5
Hb-Beta
CYP1A2
HS571B2
RRM2P4
MSN/ALAS2
Locus
17
Coalescence With Mutation
18
Mutation Creates Variation and Destroys Identity
by Descent
19
Coalescence Before Mutation


20
Mutation Before Coalescence
Mutation


21
Mutation and Coalescence Genetic Diversity
Expected Heterozygosity (where ???xNef?)
22
Gene Vs. Allele (Haplotype) Tree
23
Gene Trees vs. Haplotype Trees
Gene trees are genealogies of genes. They
describe how different copies at a homologous
gene locus are related by ordering coalescent
events. The only branches in the gene tree that
we can observe from sequence data are those
marked by a mutation. All branches in the gene
tree that are caused by DNA replication without
mutation are not observable. Therefore, the tree
observable from sequence data retains only those
branches in the gene tree associated with a
mutational change. This lower resolution tree is
called an allele or haplotype tree. The allele
or haplotype tree is the gene tree in which all
branches not marked by a mutational event are
collapsed together.
24
Unrooted Haplotype Tree
25
Haplotype trees are not new in population
genetics they have been around in the form of
inversion trees since the 1930s.
26
Haplotype Trees Can Coalesce Both Within And
Between Species
27
Ebersberger et al. (2007) Estimated Trees From
23,210 DNA Sequences In Apes Rhesus Monkey
Below Are The Numbers That Significantly Resolved
the Species Tree
28
Haplotype Trees ?Species or Population Trees
29
It is dangerous to equate a haplotype tree to a
species tree.It is NEVER justified to equate a
haplotype tree to a tree of populations within a
species because the problem of lineage sorting is
greater and the time between events is shorter.
Moreover, a population tree need not exist at
all.
30
Homoplasy The Infinite Sites Model
  • Homoplasy is the phenomenon of independent
    mutations ( many gene conversion events)
    yielding the same genetic state.
  • Homoplasy represents a major difficulty when
    trying to reconstruct evolutionary trees, whether
    they are haplotype trees or the more traditional
    species trees of evolutionary biology.
  • It is common in coalescent theory (and molecular
    evolution in general) to assume the infinite
    sites model in which each mutation occurs at a
    new nucleotide site.
  • Under this model, there is no homoplasy because
    no nucleotide site can ever mutate more than
    once. Each mutation creates a new haplotype.

31
Homoplasy The Infinite Sites Model
32
Homoplasy The Infinite Sites Model
33
E. g., Apoprotein E Gene Region
No recombination has been detected in this region.
34
The Apo-protein E Haplotype Tree
35
The Apo-protein E Haplotype Tree
Use a Finite Sites mutation model that allows
homoplasy. Can show that probability of
homoplasy between two nodes increasing with
increasing number of observed mutational
differences. Therefore, allocate homoplasies to
longer branches. Called Statistical Parsimony
because you can use models to calculate the
probability of violating parsimony for a given
branch length.
36
The Apo-protein E Statistical Parsimony Haplotype
Tree
Homoplasy is still common, as shown by circled
mutations.
In this case, most of the homoplasy is associated
with Alu sequences, a common repeat type in the
human genome that is known to cause local gene
conversion, which mimics the effects of parallel
mutations.
37
Estimated Times To Common Ancestor (Method of
Takahata et al. 2001)
Dhc Nuc.Diff. Between Humans Chimps
Dh Nuc.Diff. Within Humans
TMRCA 12Dh/Dhc
6 Million Years Ago
38
The Apo-protein E Haplotype Coalescent
3.2
2.4
3937
1.6
Years (x 105)
4075
1163
2440
73
0.8
1998
5229B
308
4036
471
4951
3673
1522
2907
624
0
545
3106
3701
9 16 6 27 2 28 1 14 29 30 12 13 17 20
5 31
?2
?3
?4
39
Estimate the distribution of the age of the
haplotype or clade as a Gamma Distribution
(Kimura, 1970) with mean T4N (or N for mtDNA)
and Variance T2/(1k) (Tajima, 1983)where k is
the average pairwise divergence among present day
haplotypes derived from the haplotype being aged,
measured as the number of nucleotide
differences.NOTE VARIANCE INCREASES WITH
INCREASING T AND DECREASING k!
40
The Apo-protein E Haplotype Coalescent
3.2
2.4
3937
1.6
Years (x 105)
1163
4075
2440
73
0.8
1998
f(t)
5229B
308
4036
471
4951
3673
624
1522
2907
3106
545
0
3701
9 16 6 27 2 28 1 14
29 30 12 13 17 20 5 31
?2
?3
?4
Years (x 105)
41
Because of Deviations From The Infinite Sites
Model, Corrections Must Also be Made in How We
Count the Number of Mutations That Occurred in
The Coalescent Process.
42
The Basic Idea of Coalescence Is That Any Two
Copies of Homologous DNA Will Coalesce Back To An
Ancestral Molecule Either Within Or Between
Species
Time
t
43
Mutations Can Accumulate in the Two DNA Lineages
During This Time, t, to Coalescence. We Quantify
This Mutational Accumulation Through A Molecule
Genetic Distance
Time
t
X Mutations
Y Mutations
44
Molecule Genetic Distance X Y.If ? the
neutral substitution rate, then the Expected
Value of X ?t and the Expected Value of Y ?t,
So the Expected Value of the Genetic Distance
2?t
Complication Only Under The Infinite Sites
Model Are XY Directly Observable Otherwise XY
The Observed Number of Differences.
Time
t
X Mutations
Y Mutations
Use Models of DNA Mutation To Correct For
Undercounting
45
Molecule Genetic Distance X Y 2 ?tTHE
JUKES-CANTOR GENETIC DISTANCE
Consider a single nucleotide site that has a
probability ? of mutating per unit time (only
neutral mutations are allowed). This model
assumes that when a nucleotide site mutates it is
equally likely to mutate to any of the three
other nucleotide states. Suppose further that
mutation is such a rare occurrence that in any
time unit it is only likely for at most one DNA
lineage to mutate and not both. Finally, let pt
be the probability that the nucleotide site is in
the same state in the two DNA molecules being
compared given they coalesced t time units ago.
Note that pt refers to identity by state and is
observable from the current sequences. Then,
46
Molecule Genetic Distance X Y 2 ?tTHE
JUKES-CANTOR GENETIC DISTANCE
Approximating the above by a differential
equation yields
extract 2?t from the equation given above
47
Molecule Genetic Distance X Y 2 ?tTHE
JUKES-CANTOR GENETIC DISTANCE
The above equation refers to only a single
nucleotide, so pt is either 0 and 1. Hence, this
equation will not yield biologically meaningful
results when applied to just a single nucleotide.
Therefore, Jukes and Cantor (1969) assumed that
the same set of assumptions is valid for all the
nucleotides in the sequenced portion of the two
molecules being compared. Defining ? as the
observed number of nucleotides that are different
divided by the total number of nucleotides being
compared, Jukes and Cantor noted that pt is
estimated by 1-?. Hence, substituting 1-? for pt
yields
48
Molecule Genetic Distance X Y 2 ?tTHE
KIMURA 2-PARAMETER GENETIC DISTANCE
The Jukes and Cantor genetic distance model
assumes neutrality and that mutations occur with
equal probability to all 3 alternative nucleotide
states. However, for some DNA, there can be a
strong transition bias (e.g., mtDNA)
where ? is the rate of transition substitutions,
and 2??is the rate of transversion substitutions.
The total rate of substitution (mutation)
?????????
49
Molecule Genetic Distance X Y 2 ?tTHE
KIMURA 2-PARAMETER GENETIC DISTANCE
Kimura (J. Mol. Evol. 16 111-120, 1980) showed
that GENETIC DISTANCE Dt 2(??????)t
-1/2ln(1-2P-Q) - 1/4ln(1-2Q) where P is the
observed proportion of homologous nucleotide
sites that differ by a transition, and Q is the
observed proportion of homologous nucleotide
sites that differ by a transversion.
Note that if ??????(no transition bias), then we
expect P Q/2, so ? PQ 3/2Q, or Q 2/3?.
This yields the Jukes and Cantor distance, which
is therefore a special case of the Kimura
Distance.
If ???????(large transition bias), as t gets
large, P converges to 1/4 regardless of time,
while Q is still sensitive to time. Therefore,
for large times and with molecules showing an
extreme transition bias, the distances depend
increasingly only on the transversions.
Therefore, you can get a big discrepancy between
these two distances when a transition bias exists
and when t is large enough.
50
Molecule Genetic Distance X Y 2 ?t
You can have up to a 12 parameter model for just
a single nucleotide (a parameter for each
arrowhead). You can add many more parameters if
you consider more than 1 nucleotide at a time.
If distances are small (Dt 0.05), most
alternatives give about the same value, so people
mostly use Jukes and Cantor, the simplest
distance. Above 0.05, you need to investigate
the properties of your data set more carefully.
ModelTest can help you do this (I emphasize help
because ModelTest gives some statistical criteria
for evaluating 56 different models -- but
conflicts frequently arise across criteria, so
judgment is still needed).
LOOK AT YOUR DATA!
51
Recombination Can Create Complex Networks Which
Destroy the Treeness of the Relationships Among
Haplotypes.
52
(Templeton et al.,AMJHG 66 69-83, 2000)
LD in the human LPL gene
Recombination is not Uniformly distributed in
the human genome, but rather is Concentrated into
hotspots that Separate regions of low to
no Recombination.
Significant D
Non-significant D
Too Few Observations for any D to be
significant
53
Because of the random mating equation
DtD0(1-r)tLinkage Disequilibrium Is Often
Interpreted As An Indicator of the Amount of
Recombination. This Is Justifiable When
Recombination Is Common Relative To
MutationHowever, in regions of little to no
recombination, the pattern of disequilibrium is
determined primarily by the historical conditions
that existed at the time of mutation, that is the
Haplotype Tree.
54
Apoprotein E Gene Region
Note, African-Americans Have More D Than
Europeans EA Because of Admixture Not All D
Reflects Linkage
55
The Apo-protein E Haplotype Tree
14
21
30
1522
1575
5361
2907
560
26
624
17
20
18
624
1
4
560
29
560
4951
73
3701
832
11
23
28
19
624
545
5361
4036
1163
471
3937
624
5361
832
2440
1998
1998
2
15
25
6
7
5
12
560
560
3
8
560
3106
5229B
4951
31
308
13
27
3673
4075
560
10
16
624
624
4951
24
9
560
1575
22
About PowerShow.com