Title: Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits
1Inference of Complex Genealogical Histories In
Populations and Application in Mapping Complex
Traits
- Yufeng Wu
- Dept. of Computer Science and Engineering
- University of Connecticut
2Genealogy Evolutionary History of Genomic
Sequences
- Tells how sequences in a population are related
- Helps to explain diseases disease mutations
occur on branches and all descendents carry the
mutations - Genealogy unknown. Only have SNP haplotypes
(binary sequences). - Problem Inference of genealogy for unrelated
haplotypes - Not easy partly due to recombination
Diseased (case)
Healthy (control)
Sequences in current population
3Recombination
- One of the principle genetic forces shaping
sequence variations within species - Two equal length sequences generate a third new
equal length sequence in genealogy - Spatial order is important different parts of
genome inherit from different ancestors.
110001111111001
1100 00000001111
000110000001111
4Ancestral Recombination Graph (ARG)
Mutations
Recombination
10
01
00
10
01
00
11
S1 00 S2 01 S3 10 S4 11
Assumption At most one mutation per site
S1 00 S2 01 S3 10 S4 10
5What is the Use of an ARG
May look at the ARG directly. But for noisy
data another way of using ARGs an ARG
represents a set of local trees!
Data 0000 0101 0110 1110 1010
0000
0000
0100
0010
Local trees evolutionary history for different
genomic regions between recombination breakpoints.
1010
0110
0101
1010
0000
0110
1110
6At which Local Tree Did Disease Mutations Occur
- Clear separation of cases/controls not expected
for complex diseases
Case
Control
7How to infer ARGs
- But we do not know the true ARG!
- Goal infer ARGs from haplotypes
- First practical ARG association mapping method
(Minichiello and Durbin 2006) - Use plausible ARGs heuristic
- Less complex disease model implicitly assume one
disease mutation with major effects. - My results (Wu RECOMB 2007)
- Generate ARGs with a provable property and works
on a well-defined complex disease model - Focus on parsimonious history
8Simulation Results (Wu 2007)
- TMARG/MARGARITA sample ARGs decompose to
local trees and look for association signals. - LATAG infer local trees at focal points.
- Average mapping error for 50 simulated datasets
from Zollner and Pritchard
Comparison TMARG (minARGs) TMARG (near
minARGs) LATAG (Z. P.) MARGARITA (M.
D.). TMARG (my program) and MARGRITA are much
faster than LATAG.
9Preliminary Results GAW16 Data
- GAW16 data from the North American Rheumatoid
Arthritis Consortium (NARAC) 868 cases and 1194
controls. Chromosome one 40929 SNPs. - Running TMARG on large-scale data
- Break into non-overlapping windows
- Run fastPHASE (Scheet and Stephens 06) to
obtain haplotypes - Run TMARG with Chi-square mode
Caution more investigation needed.
10A Related ProblemInference of Local Tree
Topologies Directly (Wu 2008 Submitted)
11Inference of Local Tree Topologies
- Recall ARG represents a set of local trees.
- Question given SNP haplotypes infer local tree
topologies (one tree for each SNP site ignore
branch length) - Hein (1990 1993)
- Song and Hein (20032005) enumerate all possible
tree topologies at each site - Parsimony-based
12Local Tree Topologies
- Key technical difficulty enumerate all tree
topologies - Brute-force enumeration of local tree topologies
not feasible when number of sequences gt 9 - Trivial solution create a tree for a SNP
containing the single split induced by the SNP. - Always correct (assume one mutation per site)
- But not very informative need more refined trees!
A 0 B 0 C 1 D 0 E 1 F 0 G 1 H 0
13How to do better Neighboring Local Trees are
Similar!
- Nearby SNP sites provide hints!
- Near-by local trees are often topologically
similar - Recombination often only alters small parts of
the trees - Key idea reconstruct local trees by combining
information from multiple nearby SNPs
14RENT REfining Neighboring Trees
- Maintain for each SNP site a (possibly
non-binary) tree topology - Initialize to a tree containing the split induced
by the SNP - Gradually refining trees by adding new splits to
the trees - Splits found by a set of rules (later)
- Splits added early may be more reliable
- Stop when binary trees or enough information is
recovered
15A Little Background Compatibility
1 2 3
a b c d e
0 0 0 1 0 0 0 0 1 1 0 1 0 1 1
Sites 1 and 2 are compatible but 1 and 3 are
incompatible.
M
- Two sites (columns) p q are incompatible if
columns pq contains all four ordered pairs
(gametes) 00 01 10 11. Otherwise p and q are
compatible. - Easily extended to splits.
- A split s is incompatible with tree T if s is
incompatible with any one split in T. Two trees
are compatible if their splits are pairwise
compatible.
16Fully-Compatible Region Simple Case
- A region of consecutive SNP sites where these
SNPs are pairwise compatible. - May indicate no topology-altering recombination
occurred within the region - Rule for site s add any such split to tree at
s. - Compatibility very strong property and unlikely
arise due to chance.
17Split Propagation More General Rule
- Three consecutive sites 12 and 3. Sites 1 and 2
are incompatible. Does site 3 matter for tree at
site 1 - Trees at site 1 and 2 are different.
- Suppose site 3 is compatible with sites 1 and 2.
Then - Site 3 may indicate a shared subtree in both
trees at sites 1 and 2. - Rule a split propagates to both directions until
reaching a incompatible tree.
18One Subtree-Prune-Regraft (SPR) Event
- Recombination simulated by SPR.
- The rest of two trees (without pruned subtrees)
remain the same - Rule find compatible subtree Ts in neighboring
trees T1 and T2 s.t. the rest of T1 and T2 (Ts
removed) are compatible. Then joint refine T1- Ts
and T2- Ts before adding back Ts.
More complex rules possible.
19Simulation
- Hudsons program MS (with known coalescent local
tree topologies) 100 datasets for each settings. - Data much larger and perform better or similarly
for small data than Song and Heins method. - Test local tree topology recovery scored by Song
and Heins shared-split measure
15
50
20Acknowledgement
- More information available at http//www.engr.uco
nn.edu/ywu - I want to thank
- Dan Gusfield
- Yun S. Song
- Charles Langley
- Dan Brown
- And National Science Foundation and UConn
Research Foundation