Title: Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population
1Inferring Local Tree Topologies for SNP Sequences
Under Recombination in a Population
- Yufeng Wu
- Dept. of Computer Science and Engineering
- University of Connecticut, USA
2Genetic Variations
Sites
AATGTAGCCGA AATATAACCTA AATGTAGCCGT AATGTAACCTA CA
TATAGCCGT
AATGTAGCCGA AATATAACCTA AATGTAGCCGT AATGTAACCTA CA
TATAGCCGT
Each SNP induces a split
DNA sequences
- Single-nucleotide polymorphism (SNP) a site
(genomic location) where two types of nucleotides
occur frequently in the population. - Haplotype, a binary vector of SNPs (encoded as
0/1). - Haplotypes offer hints on genealogy.
3Genealogy Evolutionary History of Genomic
Sequences
- Tells how individuals in a population are
related - Helps to explain diseases disease mutations
occur on branches and all descendents carry the
mutations - Problem How to determine the genealogy for
unrelated individuals? - Complicated by recombination
Diseased (case)
Healthy (control)
Individuals in current population
4Recombination
- One of the principle genetic forces shaping
sequence variations within species - Two equal length sequences generate a third new
equal length sequence in genealogy - Spatial order is important different parts of
genome inherit from different ancestors.
110001111111001
000110000001111
5Ancestral Recombination Graph (ARG)
Mutations
Recombination
10
01
00
10
01
00
11
S1 00 S2 01 S3 10 S4 11
Assumption At most one mutation per site
S1 00 S2 01 S3 10 S4 10
6Local Trees
ARG
- ARG represents a set of local trees.
- Each tree for a continuous genomic region.
- No recombination between two sites ? same local
trees for the two sites - Local tree topology informative and useful
Local tree near sites 1 and 2
Local tree near site 2
Local tree to the right of site 3
7Inference of Local Tree Topologies
- Question given SNP haplotypes, infer local tree
topologies (one tree for each SNP site, ignore
branch length) - Hein (1990, 1993)
- Enumerate all possible tree topologies at each
site - Song and Hein (2003,2005)
- Parsimony-based
- Local tree reconstruction can be formulated as
inference on a hidden Markov model.
8Local Tree Topologies
- Key technical difficulty
- Brute-force enumeration of local tree topologies
not feasible when number of sequences gt 9 - Can not enumerate all tree topologies
- Trivial solution create a tree for a SNP
containing the single split induced by the SNP. - Always correct (assume one mutation per site)
- But not very informative need more refined trees!
A 0 B 0 C 1 D 0 E 1 F 0 G 1 H 0
9How to do better? Neighboring Local Trees are
Similar!
- Nearby SNP sites provide hints!
- Near-by local trees are often topologically
similar - Recombination often only alters small parts of
the trees - Key idea reconstructing local trees by combining
information from multiple nearby SNPs
10RENT REfining Neighboring Trees
- Maintain for each SNP site a (possibly
non-binary) tree topology - Initialize to a tree containing the split induced
by the SNP - Gradually refining trees by adding new splits to
the trees - Splits found by a set of rules (later)
- Splits added early may be more reliable
- Stop when binary trees or enough information is
recovered
11A Little Background Compatibility
1 2 3 4 5
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
Sites 1 and 2 are compatible, but 1 and 3 are
incompatible.
M
- Two sites (columns) p, q are incompatible if
columns p,q contains all four ordered pairs
(gametes) 00, 01, 10, 11. Otherwise, p and q are
compatible. - Easily extended to splits.
- A split s is incompatible with tree T if s is
incompatible with any one split in T. Two trees
are compatible if their splits are pairwise
compatible.
12Fully-Compatible Region Simple Case
- A region of consecutive SNP sites where these
SNPs are pairwise compatible. - May indicate no topology-altering recombination
occurred within the region - Rule for site s, add any such split to tree at
s. - Compatibility very strong property and unlikely
arise due to chance.
13Split Propagation More General Rule
- Three consecutive sites 1,2 and 3. Sites 1 and 2
are incompatible. Does site 3 matter for tree at
site 1? - Trees at site 1 and 2 are different.
- Suppose site 3 is compatible with sites 1 and 2.
Then? - Site 3 may indicate a shared subtree in both
trees at sites 1 and 2. - Rule a split propagates to both directions until
reaching a incompatible tree.
14Unique Refinement
- Consider the subtree with leaves 1,2 and 3.
- Which refinement is more likely?
- Add split of 1 and 2 the only split that is
compatible with neighboring T2. - Rule refine a non-binary node by the only
compatible split with neighboring trees
15One Subtree-Prune-Regraft (SPR) Event
- Recombination simulated by SPR.
- The rest of two trees (without pruned subtrees)
remain the same - Rule find identical subtree Ts in neighboring
trees T1 and T2, s.t. the rest of T1 and T2 (Ts
removed) are compatible. Then joint refine T1- Ts
and T2- Ts before adding back Ts.
Subtree to prune
More complex rules possible.
16Simulation
- Hudsons program MS (with known coalescent local
tree topologies) 100 datasets for each settings. - Data much larger and perform better or similarly
for small data than Song and Heins method. - Test local tree topology recovery scored by Song
and Heins shared-split measure
? 15
? 50
17Acknowledgement
- Software available upon request.
- More information available at http//www.engr.uco
nn.edu/ywu - I want to thank
- Yun S. Song
- Dan Gusfield