Title: CSE 291: Advanced Topics in Computational Biology L3: population substructure admixture mapping
1CSE 291 Advanced Topics in Computational
BiologyL3 population sub-structure/admixture
mapping
- Vineet Bafna/Pavel Pevzner
www.cse.ucsd.edu/classes/sp05/cse291-a
2Ancestral Recombination Graph
3Review Coalescent theory applications
- Coalescent simulations allow us to test various
hypothesis. The coalescent/ARG is usually not
inferred, unlike in phylogenies.
4Coalescent theory example
- Ex 1400bp at Sod locus in Dros.
- 10 taxa
- 5 were identical. The other 5 had 55 mutations.
- Q Is this a chance event, or is there selection
for this haplotype.
5Coalescent application
- 10000 coalescent simulations were performed on 10
taxa. - 55 mutations on the coalescent branches
- Count the number of times 5 lineages are
identical - The event happened in 1.1 of the cases.
- Conclusion selection, or some other mechanism
explains this data.
6Coalescent example Out of Africa hypothesis
- Looking at lineage specific mutations might help
discard the candelabra model. How? - How do we decide between the multi-regional and
Out-of-Africa model? How do we decide if the
ancestor was African?
7Coalescent simulation
- Example 2
- Sample from a region. What is the recombination
rate in this region? - Gabriel et al. Science 2002.
- 3 populations were sampled at multiple regions
spanning the genome - 54 regions (Average size 250Kb)
- SNP density 1 over 2Kb
- 90 Individuals from Nigeria (Yoruban)
- 93 Europeans
- 42 Asian
- 50 African American
8Population specific recombination
- D was used as the measure between SNP pairs.
- SNP pairs were in one of the following
- Strong LD
- Strong evidence for recombination
- Others (13 of cases)
- Can this be used for Out-Of-Africa hypothesis?
Gabriel et al., Science 2002
9Haplotype Blocks
- A haplotype block is a region of low
recombination. - Define a region as a block if less than 5 of the
pairs show strong recombination - Much of the genome is in blocks.
- Distribution of block sizes vary across
populations.
10Testing Out-of-Africa
- Generate simulations with and without migration.
- Check size of haplotype blocks.
- Does it vary when migrations are allowed?
- When the new population has a bottleneck?
- If there were a bottleneck that created European
and Asian populations, can we say anything about
frequency of alleles that are African specific? - Should they be high frequency, or low frequency
in African populations?
11Haplotype Block implications
- The genome is mostly partitioned into haplotype
blocks. - Within a block, there is extensive LD.
- Is this good, or bad, for association mapping?
12Population Sub-structure
13Population sub-structure can increase LD
- Consider two populations that were isolated and
evolving independently. - They might have different allele frequencies in
some regions. - Pick two regions that are far apart (LD is very
low, close to 0)
14Recent ad-mixing of population
- If the populations came together recently (Ex
African and European population), artificial LD
might be created. - D 0.15 (instead of 0.01), increases 10-fold
- This spurious LD might lead to false associations
- Other genetic events can cause LD to arise, and
one needs to be careful
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
15Determining population sub-structure
- Given a mix of people, can you sub-divide them
into ethnic populations. - Turn the problem of spurious LD into a clue.
- Find markers that are too far apart to show LD
- If they do show LD (correlation), that shows the
existence of multiple populations. - Sub-divide them into populations so that LD
disappears.
16Determining Population sub-structure
- Same example as before
- The two markers are too similar to show any LD,
yet they do show LD. - However, if you split them so that all 0..1 are
in one population and all 1..0 are in another, LD
disappears
17Iterative algorithm for population sub-structure
- Define
- N number of individuals (each has a single
chromosome) - k number of sub-populations.
- Z ? 1..kN is a vector giving the
sub-population. - Zik gt individual i is assigned to population
k - Xi,j allelic value for individual i in position
j - Pk,j,l frequency of allele l at position j in
population k
18Example
- Ex consider the following assignment
- P1,1,0 0.9
- P2,1,0 0.1
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
19Goal
- X is known.
- P, Z are unknown.
- The goal is to estimate Pr(P,ZX)
- Various learning techniques can be employed.
- maxP,Z Pr(XP,Z) (Max likelihood estimate)
- maxP,Z Pr(XP,Z) Pr(P,Z) (MAP)
- Sample P,Z from Pr(P,ZX)
- Here a Bayesian (MCMC) scheme is employed to
sample from Pr(P,ZX). We will only consider a
simplified version
20AlgorithmStructure
- Iteratively estimate
- (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
- After convergence, Z(m) is the answer.
- Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m) from Pr(P X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m))
- How is this sampling done?
21Example
- Choose Z at random, so each individual is
assigned to be in one of 2 populations. See
example. - Now, we need to sample P(1) from Pr(P X, Z(0))
- Simply count
- Nk,j,l number of people in pouplation k which
have allele l in position j - pk,j,l Nk,j,l / N
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
22Example
- Nk,j,l number of people in population k which
have allele l in position j - pk,j,l Nk,j,l / Nk,j,
- N1,1,0 4
- N1,1,1 6
- p1,1,0 4/10
- p1,2,0 4/10
- Thus, we can sample P(m)
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
23Sampling Z
- PrZ1 1 Pr01 belongs to population 1?
- We know that each position should be in linkage
equilibrium and independent. - Pr01 Population 1 p1,1,0 p1,2,1
(4/10)(6/10)(0.24) - Pr01 Population 2 p2,1,0 p2,2,1
(6/10)(4/10)0.24 - Pr Z1 1 0.24/(0.240.24) 0.5
Assuming, HWE, and LE
24Sampling
- Suppose, during the iteration, there is a bias.
- Then, in the next step of sampling Z, we will do
the right thing - Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
0.49 - Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
0.09 - PrZ1 1 0.49/(0.490.09) 0.85
- PrZ6 1 0.49/(0.490.09) 0.85
- Eventually all 01 will become 1 population, and
all 10 will become a second population
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
25Allowing for admixture
- Define qi,k as the fraction of individual i that
originated from population k. - Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m),Q(m))
26Estimating Z (admixture case)
- Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
Pr(Z(i,j,l)kX,P,Q)
i,1
i,2
j
27Results on admixture prediction
28Results Thrush data
- For each individual, q(i) is plotted as the
distance to the opposite side of the triangle. - The assignment is reliable, and there is evidence
of admixture.
29Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
Oceania
Eurasia
East Asia
America
Africa
30Population sub-structureresearch problem
- Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual - The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem - (w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations - (w/ admixture) Assign (individuals, loci) to
sub-populations
31Admixture mapping
32(No Transcript)