Title: Inference on population structure using multilocus genotype data STRUCTURE V2'1 Pritchard, J'K', and
1Inference on population structure using
multi-locus genotype dataSTRUCTURE V2.1
Pritchard, J.K., and Wen, W. (2004)
2Model based cluster analysis
- We assume some statistical distribution on each
individual
3The mixture of normal distributions model (a
very simple case)
4N(0.098,0.67) y N(5.098,1.69)
o o o o o o o
o o ooo o oo o
o
5the f.d.p are
6or the mixture (each individual follows the
distribution)
7f(y) 0.7 N(0.098, 0.67) 0.3 N(5.098, 1.69)
o o o o o o o
o o ooo o oo o o
8(No Transcript)
9Membership probability
- Pyi ? ?1 Pyi ? ?2 1
- i 1,2,,n (individuals)
- if Pyi ? ?1 gt Pyi ? ?2 then yi ? ?1
10Two variables X1, X2
11Mixture of three Normal bivariate
12with f.d.p
13Where
14Inference on population structure using
multi-locus genotype dataSTRUCTURE V2.1
Pritchard, J.K., and Wen, W. (2004)
- Pritchard, Stephens, and Donnelly (2000)
- Falush, Stephens, and Pritchard (2003)
15Main objective
- Assign individuals to populations on the bases of
their genotypes, while simultaneously estimating
population allele frequencies
16Other objectives
- Begin with a set of predefined populations and to
classify individuals of unknown origin - Identify the extent of admixture of individuals
- Infer the origin of particular loci in the
sampled individuals
17Structure is a Model Based method of clustering
- (we must be assumptions about a lot of parameters
and distributions)
18Four basic models
- Model without admixture
- each individual is assumed to originate in one
(only one) of K populations - Model with admixture
- each individual is assumed to have inherited some
proportion of its ancestry from each of K
populations
19Four basic models
- Linkage model
- Chunks of chromosomes as derived as intact
units from one or another K population and all
allele copies on the same chunk derive from the
same population. - The model consider the derived correlations in
ancestry
20Four basic models
- F model
- The populations all diverged from a common
ancestral population at the same time, but allows
that the populations may have experienced
different amounts of drift since the divergence
event
21Assumptions
- Our main modeling assumptions are
Hardy-Weinberg equilibrium within populations and
complete linkage equilibrium between loci within
populations - Loosely speaking, the idea here is that the
model accounts for the presence oh HWD or LD by
introducing population structure and attempts to
find populations groupings that (as far as
possible) are not in disequilibrium
22Data
- Consider a sample of N individuals each one
genotyped at L loci - Assume that the individuals represent a mixture
of K unobserved populations (K unknown) - If diploid, we have an N2L data matrix X
- If n-ploid X is N
- where Jl is the number of alleles at the lth
locus
23 l 1 l l l
L j1 j2 j1 j2 j1
j2
X is N2L
24Example
25Model without admixture
- each individual is assumed to originate in one
(only one) of K populations
26P-matrix (allele frequencies by population)
l 1 l l l
L j1 j2 j1 j2 j1
j2
pklj is the frequency of the jth allele, at the
lth loci, at the kth population k1,2,,K
l1,2,,L j1,2 (diploid)
27z-vector (membership of the ith individual to
kth population)
- If the ith individual is a member of the kth
population then z(i) k
- P(z(i) k) is the membership
- probability
28Model with admixture
- each individual is assumed to have inherited some
proportion of its ancestry from each of K
populations
29P-matrix is equal to the above model
Q-matrix (proportion of the genome of
the ith individual inherited from the kth
population)
i1,2,,N k1,2,,K
30Z-matrix
zl(i,j) is equal to k if the jth allele at the
lth loci at the ith individual was originated
from the kth population k1,2,,K l1,2,,L
j1,2 (diploid)
31F Model
- The populations all diverged from a common
ancestral population at the same time, but allows
that the populations may have experienced
different amounts of drift since the divergence
event
32Ancestral population (diploid)
Conditional on PA
33F-model
Fk is the drift rate of the kth population, and
it is associated to the Wrights Fst
Fk
pklj pAlj
34Interpreting FST
- Can range from 0 (no genetic differentiation) to
1 (fixation of alternative alleles). - Wrights Guidelines
- 0 - 0.05, little differentiation
- 0.05 0.15, moderate
- 0.15 0.25, great
- gt 0.25, very great
35The Dirichlet distribution
- The probability density of the Dirichlet
distribution for variables
p (p1, p2,, pn) -
- with parameters u (u1,u2,,un)
- is defined by
- The parameters ui can be interpreted as prior
observation counts''
36(No Transcript)
37Kullback-Leiber
- The Kullback-Leiber divergence is a non-negative
value and equals 0 only when the two
distributions are identical. The Kullback-Leiber
divergence is a measure of the discriminative
power between the probability distributions of
the two classes