Loading...

PPT – Coalescence PowerPoint presentation | free to download - id: e0631-NmUzN

The Adobe Flash plugin is needed to view this content

Coalescence

DNA Replication

DNA Coalescence

A coalescent event occurs when two lineages of

DNA molecules merge back into a single DNA

molecule at some time in the past.

Gene Tree (all copies of homologous DNA coalesce

to a common ancestral molecule)

COALESCENCE OF n COPIES OF HOMOLOGOUS DNA

Coalescence in an Ideal Population of N with

Ploidy Level x

- Each act of reproduction is equally likely to

involve any of the N individuals, with each

reproductive event being an independent event - Under these conditions, the probability that two

gametes are drawn from the same parental

individual is 1/N - With ploidy level x, the probability of identity

by descent/coalescence from the previous

generation is (1/x)(1/N) 1/(xN) - In practice, real populations are not ideal, so

pretend the population is ideal but with an

inbreeding effective size of an idealized

population of size Nef Therefore, the prob. of

coalescence in one generation is 1/(xNef)

Sample Two Genes at Random

The probability of coalescence exactly t

generations ago is the probability of no

coalescence for the first t-1 generations in the

past followed by a coalescent event at generation

t

Sample Two Genes at Random

The average time to coalescence is

The variance of time to coalescence of two genes

(??ct) is the average or expectation of (t-xNef)2

Sample n Genes at Random

Sample n Genes at Random

Sample n Genes at Random

Once the first coalescent event has occurred, we

now have n-1 gene lineages, and therefore we

simply repeat all the calculations with n-1

rather than n. In general, the expected time and

variance between the k1 coalescent event and the

kth event is

Sample n Genes at Random

The average times to the first and last

coalescence are

2xNef/n(n-1) and 2xNef(1-1/n)

- Let n 10 and x2, then the time span covered by

coalescent events is expected to range from

0.0444Nef to 3.6Nef. - Let n 100, then the time span covered by

coalescent events is expected to range from

0.0004Nef to 3.96Nef. - These equations imply that you do not need large

samples to cover deep (old) coalescent events,

but if you want to sample recent coalescent

events, large sample sizes are critical. - For n large, the expected coalescent time for all

genes is 2xNef

Sample n Genes at Random

The variance of time to coalescence of n genes

is

- Note that in both the 2- and n-sample cases, the

mean coalescent times are proportional to Nef and

the variances are proportional to Nef2. - The Standard Molecular Clock is a Poisson Clock

in Which the Mean Variance. - The Coalescent is a noisy evolutionary process

with much inherent variation that cannot be

eliminated by large ns it is innate to the

evolutionary process itself and is called

evolutionary stochasticity.

Buris Experi-ment on Genetic Drift

Fixation (Coalescence) Times in 105 Replicates of

the Same Evolutionary Process

Problem No Replication With Most Real Data

Sets. Only 1 Realization.

Evolutionary Stochasticity

Using the standard molecular clock and an

estimator of ? of 10-8 per year, the time to

coalescence of all mtDNA to a common ancestral

molecule has been estimated to be 290,000 years

ago (Stoneking et al. 1986). This figure of

290,000 however is subject to much error because

of evolutionary stochasticity. When evolutionary

stochasticity is taken into account (ignoring

sampling error, measurement error, and the

considerable ambiguity in ?), the 95 confidence

interval around 290,000 is 152,000 years to

473,000 years (Templeton 1993) -- a span of over

300,000 years!

Coalescence of a mtDNA in an Ideal Population of

N? haploids

- Each act of reproduction is equally likely to

involve any of the N? individuals, with each

reproductive event being an independent event - Under these conditions, the probability that two

gametes are drawn from the same parental

individual is 1/N? - Under haploidy, the probability of identity by

descent/coalescence from the previous generation

is (1)(1/N?) 1/(N?) - In practice, real populations are not ideal, so

pretend the population is ideal but with an

inbreeding effective size of an idealized

population of size Nef? Therefore, the prob. of

coalescence in one generation is 1/(Nef?)

Expected Coalescence Times for a Large Sample of

Genes

Estimated Coalescence Times for 24 Human Loci

9

Uniparental Haploid DNA Regions

8

X-Linked Loci

7

6

Autosomal Loci

TMRCA (In Millions of Years)

5

4

3

2

1

0

FIX

CCR5

ECP

EDN

HFE

MX1

MAO

APLX

FUT6

FUT2

G6PD

MC1R

Y-DNA

mtDNA

Xq13.3

AMELX

PDHA1

MS205

Lactase

TNFSF5

Hb-Beta

CYP1A2

HS571B2

RRM2P4

MSN/ALAS2

Locus

Coalescence With Mutation

Mutation Creates Variation and Destroys Identity

by Descent

Coalescence Before Mutation

Mutation Before Coalescence

Mutation

Mutation and Coalescence Genetic Diversity

Expected Heterozygosity (where ???xNef?)

Gene Vs. Allele (Haplotype) Tree

Gene Trees vs. Haplotype Trees

Gene trees are genealogies of genes. They

describe how different copies at a homologous

gene locus are related by ordering coalescent

events. The only branches in the gene tree that

we can observe from sequence data are those

marked by a mutation. All branches in the gene

tree that are caused by DNA replication without

mutation are not observable. Therefore, the tree

observable from sequence data retains only those

branches in the gene tree associated with a

mutational change. This lower resolution tree is

called an allele or haplotype tree. The allele

or haplotype tree is the gene tree in which all

branches not marked by a mutational event are

collapsed together.

Unrooted Haplotype Tree

Haplotype trees are not new in population

genetics they have been around in the form of

inversion trees since the 1930s.

Haplotype Trees Can Coalesce Both Within And

Between Species

Ebersberger et al. (2007) Estimated Trees From

23,210 DNA Sequences In Apes Rhesus Monkey

Below Are The Numbers That Significantly Resolved

the Species Tree

Haplotype Trees ?Species or Population Trees

It is dangerous to equate a haplotype tree to a

species tree.It is NEVER justified to equate a

haplotype tree to a tree of populations within a

species because the problem of lineage sorting is

greater and the time between events is shorter.

Moreover, a population tree need not exist at

all.

Homoplasy The Infinite Sites Model

- Homoplasy is the phenomenon of independent

mutations ( many gene conversion events)

yielding the same genetic state. - Homoplasy represents a major difficulty when

trying to reconstruct evolutionary trees, whether

they are haplotype trees or the more traditional

species trees of evolutionary biology. - It is common in coalescent theory (and molecular

evolution in general) to assume the infinite

sites model in which each mutation occurs at a

new nucleotide site. - Under this model, there is no homoplasy because

no nucleotide site can ever mutate more than

once. Each mutation creates a new haplotype.

Homoplasy The Infinite Sites Model

Homoplasy The Infinite Sites Model

E. g., Apoprotein E Gene Region

No recombination has been detected in this region.

The Apo-protein E Haplotype Tree

The Apo-protein E Haplotype Tree

Use a Finite Sites mutation model that allows

homoplasy. Can show that probability of

homoplasy between two nodes increasing with

increasing number of observed mutational

differences. Therefore, allocate homoplasies to

longer branches. Called Statistical Parsimony

because you can use models to calculate the

probability of violating parsimony for a given

branch length.

The Apo-protein E Statistical Parsimony Haplotype

Tree

Homoplasy is still common, as shown by circled

mutations.

In this case, most of the homoplasy is associated

with Alu sequences, a common repeat type in the

human genome that is known to cause local gene

conversion, which mimics the effects of parallel

mutations.

Estimated Times To Common Ancestor (Method of

Takahata et al. 2001)

Dhc Nuc.Diff. Between Humans Chimps

Dh Nuc.Diff. Within Humans

TMRCA 12Dh/Dhc

6 Million Years Ago

The Apo-protein E Haplotype Coalescent

3.2

2.4

3937

1.6

Years (x 105)

4075

1163

2440

73

0.8

1998

5229B

308

4036

471

4951

3673

1522

2907

624

0

545

3106

3701

9 16 6 27 2 28 1 14 29 30 12 13 17 20

5 31

?2

?3

?4

Estimate the distribution of the age of the

haplotype or clade as a Gamma Distribution

(Kimura, 1970) with mean T4N (or N for mtDNA)

and Variance T2/(1k) (Tajima, 1983)where k is

the average pairwise divergence among present day

haplotypes derived from the haplotype being aged,

measured as the number of nucleotide

differences.NOTE VARIANCE INCREASES WITH

INCREASING T AND DECREASING k!

The Apo-protein E Haplotype Coalescent

3.2

2.4

3937

1.6

Years (x 105)

1163

4075

2440

73

0.8

1998

f(t)

5229B

308

4036

471

4951

3673

624

1522

2907

3106

545

0

3701

9 16 6 27 2 28 1 14

29 30 12 13 17 20 5 31

?2

?3

?4

Years (x 105)

Because of Deviations From The Infinite Sites

Model, Corrections Must Also be Made in How We

Count the Number of Mutations That Occurred in

The Coalescent Process.

The Basic Idea of Coalescence Is That Any Two

Copies of Homologous DNA Will Coalesce Back To An

Ancestral Molecule Either Within Or Between

Species

Time

t

Mutations Can Accumulate in the Two DNA Lineages

During This Time, t, to Coalescence. We Quantify

This Mutational Accumulation Through A Molecule

Genetic Distance

Time

t

X Mutations

Y Mutations

Molecule Genetic Distance X Y.If ? the

neutral substitution rate, then the Expected

Value of X ?t and the Expected Value of Y ?t,

So the Expected Value of the Genetic Distance

2?t

Complication Only Under The Infinite Sites

Model Are XY Directly Observable Otherwise XY

The Observed Number of Differences.

Time

t

X Mutations

Y Mutations

Use Models of DNA Mutation To Correct For

Undercounting

Molecule Genetic Distance X Y 2 ?tTHE

JUKES-CANTOR GENETIC DISTANCE

Consider a single nucleotide site that has a

probability ? of mutating per unit time (only

neutral mutations are allowed). This model

assumes that when a nucleotide site mutates it is

equally likely to mutate to any of the three

other nucleotide states. Suppose further that

mutation is such a rare occurrence that in any

time unit it is only likely for at most one DNA

lineage to mutate and not both. Finally, let pt

be the probability that the nucleotide site is in

the same state in the two DNA molecules being

compared given they coalesced t time units ago.

Note that pt refers to identity by state and is

observable from the current sequences. Then,

Molecule Genetic Distance X Y 2 ?tTHE

JUKES-CANTOR GENETIC DISTANCE

Approximating the above by a differential

equation yields

extract 2?t from the equation given above

Molecule Genetic Distance X Y 2 ?tTHE

JUKES-CANTOR GENETIC DISTANCE

The above equation refers to only a single

nucleotide, so pt is either 0 and 1. Hence, this

equation will not yield biologically meaningful

results when applied to just a single nucleotide.

Therefore, Jukes and Cantor (1969) assumed that

the same set of assumptions is valid for all the

nucleotides in the sequenced portion of the two

molecules being compared. Defining ? as the

observed number of nucleotides that are different

divided by the total number of nucleotides being

compared, Jukes and Cantor noted that pt is

estimated by 1-?. Hence, substituting 1-? for pt

yields

Molecule Genetic Distance X Y 2 ?tTHE

KIMURA 2-PARAMETER GENETIC DISTANCE

The Jukes and Cantor genetic distance model

assumes neutrality and that mutations occur with

equal probability to all 3 alternative nucleotide

states. However, for some DNA, there can be a

strong transition bias (e.g., mtDNA)

where ? is the rate of transition substitutions,

and 2??is the rate of transversion substitutions.

The total rate of substitution (mutation)

?????????

Molecule Genetic Distance X Y 2 ?tTHE

KIMURA 2-PARAMETER GENETIC DISTANCE

Kimura (J. Mol. Evol. 16 111-120, 1980) showed

that GENETIC DISTANCE Dt 2(??????)t

-1/2ln(1-2P-Q) - 1/4ln(1-2Q) where P is the

observed proportion of homologous nucleotide

sites that differ by a transition, and Q is the

observed proportion of homologous nucleotide

sites that differ by a transversion.

Note that if ??????(no transition bias), then we

expect P Q/2, so ? PQ 3/2Q, or Q 2/3?.

This yields the Jukes and Cantor distance, which

is therefore a special case of the Kimura

Distance.

If ???????(large transition bias), as t gets

large, P converges to 1/4 regardless of time,

while Q is still sensitive to time. Therefore,

for large times and with molecules showing an

extreme transition bias, the distances depend

increasingly only on the transversions.

Therefore, you can get a big discrepancy between

these two distances when a transition bias exists

and when t is large enough.

Molecule Genetic Distance X Y 2 ?t

You can have up to a 12 parameter model for just

a single nucleotide (a parameter for each

arrowhead). You can add many more parameters if

you consider more than 1 nucleotide at a time.

If distances are small (Dt 0.05), most

alternatives give about the same value, so people

mostly use Jukes and Cantor, the simplest

distance. Above 0.05, you need to investigate

the properties of your data set more carefully.

ModelTest can help you do this (I emphasize help

because ModelTest gives some statistical criteria

for evaluating 56 different models -- but

conflicts frequently arise across criteria, so

judgment is still needed).

LOOK AT YOUR DATA!

Recombination Can Create Complex Networks Which

Destroy the Treeness of the Relationships Among

Haplotypes.

(Templeton et al.,AMJHG 66 69-83, 2000)

LD in the human LPL gene

Recombination is not Uniformly distributed in

the human genome, but rather is Concentrated into

hotspots that Separate regions of low to

no Recombination.

Significant D

Non-significant D

Too Few Observations for any D to be

significant

Because of the random mating equation

DtD0(1-r)tLinkage Disequilibrium Is Often

Interpreted As An Indicator of the Amount of

Recombination. This Is Justifiable When

Recombination Is Common Relative To

MutationHowever, in regions of little to no

recombination, the pattern of disequilibrium is

determined primarily by the historical conditions

that existed at the time of mutation, that is the

Haplotype Tree.

Apoprotein E Gene Region

Note, African-Americans Have More D Than

Europeans EA Because of Admixture Not All D

Reflects Linkage

The Apo-protein E Haplotype Tree

14

21

30

1522

1575

5361

2907

560

26

624

17

20

18

624

1

4

560

29

560

4951

73

3701

832

11

23

28

19

624

545

5361

4036

1163

471

3937

624

5361

832

2440

1998

1998

2

15

25

6

7

5

12

560

560

3

8

560

3106

5229B

4951

31

308

13

27

3673

4075

560

10

16

624

624

4951

24

9

560

1575

22