Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics - PowerPoint PPT Presentation

About This Presentation
Title:

Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics

Description:

Association mapping: HAPMAP used to try to associate genetic-influenced diseases ... Nature has done the experiments, now we try to make sense of the results. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 49
Provided by: DanGus8
Category:

less

Transcript and Presenter's Notes

Title: Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics


1
Estimating and Reconstructing Recombination in
Populations Problems in Population Genomics
  • Dan Gusfield
  • UC Davis

Different parts of this work are joint with
Satish Eddhu, Charles Langley, Dean Hickerson,
Yun Song, Yufeng Wu, Z. Ding
INCOB06, December 20, 2006, New Delhi, India
2
What is population genomics?
  • The Human genome sequence is done.
  • Now we want to sequence many individuals in a
    population to correlate similarities and
    differences in their sequences with genetic
    traits (e.g. disease or disease susceptibility).
  • Presently, we cant sequence large numbers of
    individuals, but we can sample the sequences at
    SNP sites.

3
SNP Data
  • A SNP is a Single Nucleotide Polymorphism - a
    site in the genome where two different
    nucleotides appear with sufficient frequency in
    the population (say each with 5 frequency or
    more). Hence binary data.
  • SNP maps are being compiled with a density of
    about 1 site per 1000.
  • SNP data is what is mostly collected in
    populations - it is much cheaper to collect than
    full sequence data, and focuses on variation in
    the population, which is what is of interest.

4
Haplotype Map Project HAPMAP
  • NIH lead project (100M) to find common SNP
    haplotypes (SNP sequences) in the Human
    population.
  • Association mapping HAPMAP used to try to
    associate genetic-influenced diseases with
    specific SNP haplotypes, to either find causal
    haplotypes, or to find the region near causal
    mutations.
  • The key to the logic of Association mapping is
    historical recombination in populations. Nature
    has done the experiments, now we try to make
    sense of the results.

5
Our work Reconstructing the Evolution of SNP
Sequences
  • I Clean mathematical and algorithmic results
    Galled-Trees, near-uniqueness, graph-theory lower
    bound, and the Decomposition theorem
  • II Practical computation of Lower and Upper
    bounds on the number of recombinations needed.
    Construction of (optimal) phylogenetic networks
    uniform sampling haplotyping with ARGs
  • III Extension to Gene Conversion
  • IV Applications

6
Perfect Phylogeny Where it all starts
7
The Perfect Phylogeny Model forthe History of
SNP sequences

Only one mutation per site allowed.
sites
12345
00000
Ancestral sequence
1
4
Site mutations on edges
3
00010
The tree derives the set M 10100 10000 01011 0101
0 00010
2
10100
5
10000
01010
01011
Extant sequences at the leaves
8
When can a set of sequences be derived on a
perfect phylogeny?
  • Classic NASC Arrange the sequences in a matrix.
    Then (with no duplicate columns), the sequences
    can be generated on a unique perfect phylogeny if
    and only if no two columns (sites) contain all
    four pairs
  • 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test
9
A richer model

10100 10000 01011 01010 00010 10101 added
12345
00000
1
4
M
3
00010
2
10100
5
Pair 4, 5 fails the four gamete-test. The sites
4, 5 conflict.
10000
01010
01011
Real sequence histories often involve
recombination.
10
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point
5.
The first 4 sites come from P (Prefix) and the
sites from 5 onward come from S (Suffix).
11
Network with Recombination

10100 10000 01011 01010 00010 10101 new
12345
00000
1
4
M
3
00010
2
10100
5
10000
P
01010
The previous tree with one recombination event
now derives all the sequences.
01011
5
S
10101
12
A Phylogenetic Network or ARG
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
13
Minimizing recombinations in Phylogenetic
networks
  • Problem given a set of sequences M, find a
    phylogenetic network generating M, minimizing the
    number of recombinations used to generate M.
  • The minimization objective is a rough, but
    useful, reflection of the true number of
    observable recombinations that have occurred
    in the derivation of M.

14
Minimization is an NP-hard Problem
  • There is no known efficient solution to this
    problem and there likely will never be one.

What we can do Solve special cases optimally
with efficient algorithms (galled-trees) Solve
small data-sets optimally with algorithms that
are not provably efficient but work well
in practice Efficiently compute lower and
upper bounds on the number of needed
recombinations (HapBound, Shrub)
15
Galled-Trees an efficient special case
16
Definition Recombination Cycle
  • In a Phylogenetic Network, with a recombination
    node x, if we trace two paths backwards from x,
    then the paths will eventually meet.
  • The cycle specified by those two paths is called
    a recombination cycle.

17
Galled-Trees
  • A phylogenetic network where no recombination
    cycles share an edge is called a galled tree.
  • A cycle in a galled-tree is called a gall.
  • Problem If Haplotype matrix M cannot be
    generated on a true tree, can it be generated on
    a galled-tree?

18
Incompatibility Graph
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
19
Results about galled-trees
  • Theorem Efficient (provably polynomial-time)
    algorithm to determine whether or not any
    haplotype set H can be derived on a galled-tree.
  • Theorem A galled-tree (if one exists) produced
    by the algorithm minimizes the number of
    recombinations used over all possible
    phylogenetic-networks.
  • Theorem If M can be derived on a galled tree,
    then the Galled-Tree is nearly unique. This
    is important for biological conclusions derived
    from the galled-tree.

Gusfield et al. papers from 2003-2005.
20
Elaboration on Near Uniqueness
Theorem The number of arrangements
(permutations) of the sites on any gall is at
most three, and this happens only if the gall has
two sites. If the gall has more than two sites,
then the number of arrangements is at most
two. If the gall has four or more sites, with at
least two sites on each side of the recombination
point (not the side of the gall) then the
arrangement is forced and unique. Theorem All
other features of the galled-trees for M are
invariant.
21
Efficient Bounding Algorithms
  • We cannot efficiently compute the exact minimum
    number of needed recombinations, in general, but
    we can efficiently compute close lower and upper
    bounds on the minimum number.
  • The bounds and the computations to obtain them
    have many practical applications.

22
The general composite lower bound method (S.
Myers 2002)
Given a set of intervals on the line, and for
each interval I, a number N(I), which is a
(local) lower bound on the number of
recombinations needed in interval I, define Vmin
as the minimum number vertical lines needed so
that every interval I intersects at least N(I)
of the vertical lines. Vmin is a valid lower
bound on the total number of recombinations
needed in the whole data. Vmin is a called
a composite bound. Vmin is easy to compute by a
left-to-right myopic algorithm.
23
The Composite Method (Myers Griffiths 2003)
1. Given a set of intervals, and
2. for each interval I, a number N(I)
Composite Problem Find the minimum number of
vertical lines so that every I intersects at
least N(I) vertical lines.
M
24
Haplotype (local) Lower Bound (S. Myers)
  • Rh Number of distinct sequences (rows) - Number
    of distinct sites (columns) -1 lt minimum number
    of recombinations needed (folklore)
  • Generally Rh is really bad bound, often negative,
    when used on large intervals, but Very Good when
    used as local bounds on small intervals with the
    Composite Method, and other methods.

25
Composite Subset Method (Myers)
  • Let S be a subset of sites, and Rh(S) be the
    haplotype bound computed on the input sequences
    restricted to the sites in S. If the leftmost
    site in S is L and the rightmost site in S is R,
    then use Rh(S) as a local bound N(I) for interval
    I S,L.
  • Compute Rh(S) on many subsets, and then solve the
    composite problem to find a valid composite
    bound.

26
RecMin (S. Myers)
  • Computes local bounds using subsets of sites, but
    limits the size and the span of the subsets.
    Default parameters are s 6, w 15 (s size, w
    span).
  • Generally, impractical to set s and w large, so
    generally one doesnt know if increasing the
    parameters would increase the composite bound.
  • Still, RecMin often gives a bound more than three
    times the HK bound. Example LPL data HK gives
    22, RecMin gives 75.

27
Optimal RecMin Bound (ORB)
  • The Optimal RecMin Bound is the lower bound that
    RecMin would produce if both parameters were set
    to their maximum possible values.
  • In general, RecMin cannot compute the ORB in
    practical time.
  • We have developed a practical program, HAPBOUND,
    based on integer linear programming that
    guarantees to compute the ORB, and have
    incorporated ideas that lead to even higher lower
    bounds than the ORB.

28
HapBound The general approach
  • For an interval of sites I, let H(I) be the
    largest haplotype lower bound obtained from any
    subset of sites in I.
  • We have shown that we can efficiently compute
    H(I) by using integer linear programming.
  • We set N(I) H(I) in the composite method, and
    the resulting composite bound is the ORB.

29
HapBound vs. RecMin on LPL from Clark et al.
2 Ghz PC
30
Example where RecMin has difficulity in Finding
the ORB on a 25 by 376 Data Matrix
31
Constructing Optimal Phylogenetic Networks
  • Optimal minimum number of recombinations.
    Called Min ARG.

32
Kreitmans 1983 ADH Data
  • 11 sequences, 43 segregating sites
  • Both HapBound and SHRUB took only a fraction of a
    second to analyze this data.
  • Both produced 7 for the number of detected
    recombination events
  • Therefore, independently of all other
    methods, our lower and upper bound methods
    together imply that 7 is the minimum number of
    recombination events.

33
A Min ARG for Kreitmans data
ARG created by SHRUB
34
The Human LPL Data (Nickerson et al. 1998)
(88 Sequences, 88 sites)
Our new lower and upper bounds
Optimal RecMin Bounds
(We ignored insertion/deletion, unphased sites,
and sites with missing data.)
35
Application Association Mapping
  • Given case-control data M, uniformly sample the
    minimum ARGs (in practice for small windows of
    fixed number of SNPs)
  • Build the marginal tree for each interval
    between adjacent recombination points in the ARG
  • Look for non-random clustering of cases in the
    tree accumulate statistics over the trees to
    find the best mutation explaining the partition
    into cases and controls.

36
One Min ARG for the data
Input Data
00101 10001 10011 11111 10000 00110
Seqs 0-2 cases Seqs 3-5 controls
37
The marginal tree for the interval past both
breakpoints
Input Data
00101 10001 10011 11111 10000 00110
Seqs 0-2 cases Seqs 3-5 controls
38
(No Transcript)
39
Haplotyping (Phasing) genotypic data using a
Min ARG
40
Genotypes and Haplotypes
  • Each individual has two copies of each
    chromosome.
  • At each site, each chromosome has one of two
    alleles (states) denoted by 0 and 1 (motivated by
  • SNPs)

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
41
Haplotyping Problem
  • Biological Problem For disease association
    studies, haplotype data is more valuable than
    genotype data, but haplotype data is hard to
    collect. Genotype data is easy to collect.
  • Haplotyping (Phasing) Problem Given a set of n
    genotypes, determine the original set of n
    haplotype pairs that generated the n genotypes.
    This is hopeless without a genetic model for the
    evolution of haplotype sequences.

42
Haplotyping by Minimizing Recombinations
  • We want to haplotype genotypic data by finding
    those pairs of haplotypes (that explain the
    genotypes) and minimize the number of
    recombinations needed to derive the haplotypes.
    Minimizing recombination encodes the biology.

43
  • We have a branch and bound algorithm that finds
    the haplotypes minimizing the number of
    recombinations, building a Min ARG for deduced
    haplotypes. It works for genotype data with a
    small number of sites, but a larger number of
    genotypes.

44
Application Detecting Recombination Hotspots
with Genotype Data
  • Bafna and Bansel (2005) uses recombination lower
    bounds to detect recombination hotspots with
    haplotype data.
  • We apply our program on the genotype data
  • Compute the minimum number of recombinations for
    all small windows with fixed number of SNPs
  • Plot a graph showing the minimum level of
    recombinations normalized by physical distance
  • Initial results shows this approach can give good
    estimates of the locations of the recombination
    hotspots

45
Recombination Hotspots on Jeffreys, et al (2001)
Data
Result from Bafna and Bansel (2005), haplotype
data
Our result on genotype data
46
Application Missing Data Imputation by
Constructing near-optimal ARGs
For ?? 5.
Datasets with about 1,000 entries
Dataets with about 10,000 entries
47
Haplotyping genotype data via a minimum ARG
  • Compare to program PHASE, in order to try to
    understand why Phase is so accurate.
  • Experience shows PHASE may give solutions whose
    recombination is close to the minimum
  • Example In all solutions of PHASE for three sets
    of case/control data from Steven Orzack,
    recombinatons are minimized.
  • Simulation results PHASEs solution minimizes
    recombination in 57 of 100 data (20 rows and 5
    sites).

48
I would like to thank the organizers of Incob
2006 for inviting me, and thank you for your
attention.
Papers and Software on wwwcsif.cs.ucdavis.edu/gu
sfield
Write a Comment
User Comments (0)
About PowerShow.com