Gene Order Phylogeny - PowerPoint PPT Presentation

About This Presentation
Title:

Gene Order Phylogeny

Description:

The best current method (default TNT) fails to reach acceptable levels of ... significantly improves upon the unboosted TNT by returning trees which are at most 0.01 ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 40
Provided by: tandyw
Category:
Tags: gene | order | phylogeny | tnt

less

Transcript and Presenter's Notes

Title: Gene Order Phylogeny


1
Gene Order Phylogeny
  • Tandy Warnow
  • The Program in Evolutionary Dynamics, Harvard
    University
  • The University of Texas at Austin

2
  • Cyber-Infrastructure for Phylogenetic RESearch
    (http//www.phylo.org)
  • Main research Large-scale phylogenetics,
    reticulate evolution, gene order phylogeny,
    complex simulations, and databases
  • Funded by 11.6M ITR Grant from NSF
  • 40 biologists, computer scientists, and
    mathematicians collaborating on the project

3
CIPRes Members
4
Limitations of DNA phylogenetics
  • Deep evolutionary histories may not be
    recoverable from DNA sequence phylogeny due to
    lack of specificity -- too much noise (homoplasy)
    and insufficient sequence length
  • The systematics community has looked to rare
    genomic changes for better sources of
    phylogenetic signal

5
Whole-Genome Phylogenetics
6
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
7
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
  • Inversion (Reversal)
  • Transposition
  • Inverted Transposition

8
Other types of events
  • Duplications, Insertions, and Deletions (changes
    gene content)
  • Fissions and Fusions (for genomes with more than
    one chromosome)
  • These events change the number of copies of each
    gene in each genome (unequal gene content)

9
Genome Rearrangement Has A Huge State Space
  • DNA sequences 4 states per site
  • Signed circular genomes with n genes
    states, 1
    site
  • Circular genomes (1 site)
  • with 37 genes (mitochondria)
    states
  • with 120 genes (chloroplasts)
    states

10
Why use gene orders?
  • Rare genomic changes huge state space and
    relative infrequency of events (compared to site
    substitutions) could make the inference of deep
    evolution easier, or more accurate.
  • Our research shows this is true, but accurate
    analysis of gene order data is computationally
    very intensive!

11
Phylogeny reconstruction from gene orders
  • Distance-based reconstruction estimate pairwise
    distances, and apply methods like
    Neighbor-Joining or Weighbor
  • Maximum Parsimony find tree with the minimum
    length (inversions, transpositions, or other edit
    distances)
  • Maximum Likelihood find tree and parameters of
    evolution most likely to generate the observed
    data

12
Maximum Parsimony on Rearranged Genomes (MPRG)
  • The leaves are rearranged genomes.
  • Find the tree that minimizes the total number of
    rearrangement events (e.g., inversion phylogeny
    minimizes the number of inversions)

13
Optimization problems for gene order phylogeny
  • Breakpoint phylogeny find the phylogeny which
    minimizes the total number of breakpoints
    (NP-hard, even to find the median of three
    genomes)
  • Inversion phylogeny find the phylogeny which
    minimizes the sum of inversion distances on the
    edges (NP-hard, even to find the median of three
    genomes)

14
Inversion phylogenies
  • When the data are close to saturated, even the
    best distance-based analyses are insufficiently
    accurate. In these cases, our initial
    investigations suggest that the inversion
    phylogeny approach may be superior.
  • Problem finding the best trees is enormously
    hard, since even the point estimation problem
    is hard (worse than estimating branch lengths in
    ML).

Local optimum
Tree length
Global optimum
Phylogenetic trees
15
Observations
  • For equal gene content, heuristics for the
    inversion phylogeny problem are extremely
    accurate, even under model conditions in which
    transpositions are dominant.
  • For unequal gene content, the parsimony style
    problems are too computationally intense -- but
    NJ (neighbor joining) with a new distance
    estimator (Moret et al. 2004) works extremely
    well.

16
Software
  • BPAnalysis (Sankoff) open source, restricted to
    the breakpoint phylogeny reconstruction
  • GRAPPA (Moret et al.) open source, restricted to
    single chromosome genomes, but can handle both
    equal and unequal gene content
  • MGR (Pevzner et al.) multiple chromosome,
    limited to equal gene content, performs well if
    the dataset is small (less than 10 genomes)
  • Bayesian analysis by Bret Larget (not yet
    released).

17
Merciera
Wahlenbergia
Tiodanus
Legousia
Asyneuma
Trachelium
Symphyandra
Campanula
Codonopsis
Tobacco
Adenophora
Cyananthus
The strict consensus of 24 trees, each with
inversion length of 64. Finished within 40
minutes on a laptop using GRAPPA version 1.8
Platycodon
18
GRAPPA (Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
  • http//www.cs.unm.edu/moret/GRAPPA/
  • Heuristics for maximum parsimony style problems
    for equal gene content
  • Fast polynomial time distance-based methods
  • Contributors U. New Mexico,U. Texas at Austin,
    Universitá di Bologna, Italy
  • Freely available in source code at this site.
  • Project leader Bernard Moret (UNM)
    (moret_at_cs.unm.edu)

19
Speeding up MP and ML DCM3
  • Tandy Warnow
  • Radcliffe Institute
  • The University of Texas at Austin

20
Reconstructing the Tree of Life
Handling large datasets millions of species
21
Methods for phylogenetic inference
  • Polynomial time methods, mostly based upon
    estimating evolutionary distances
  • Heuristics for hard optimization problems (such
    as maximum parsimony and maximum likelihood)
  • Bayesian methods

22
Main research objectives
  • Determine the best current methods available for
    MP and ML, and then improve upon them
  • Focus on performance within one day, one week, or
    one month, on large real datasets (1K to 20K
    sequences for MP)
  • Final objective is hundreds of thousands (or
    millions) of sequences.

23
Initial results
  • Very large datasets are hard for both MP and ML,
    no matter what software is used
  • Suboptimal solutions to MP yield reasonable
    estimates of the optimal MP trees - but only if
    they are within .01 of optimal MP score
  • Improving upon techniques for searching treespace
    will yield improvements for both MP and ML

24
Datasets
Obtained from various researchers and online
databases
  • 1322 lsu rRNA of all organisms
  • 2000 Eukaryotic rRNA
  • 2594 rbcL DNA
  • 4583 Actinobacteria 16s rRNA
  • 6590 ssu rRNA of all Eukaryotes
  • 7180 three-domain rRNA
  • 7322 Firmicutes bacteria 16s rRNA
  • 8506 three-domain2org rRNA
  • 11361 ssu rRNA of all Bacteria
  • 13921 Proteobacteria 16s rRNA

25
Problems with current techniques for MP
Average MP scores above optimal of best methods
at 24 hours across 10 datasets
Best current techniques fail to reach 0.01 of
optimal at the end of 24 hours, on large datasets
26
Problems with current techniques for MP
The best current method (default TNT) fails to
reach acceptable levels of accuracy (0.01 of
optimal) within 24 hours on many large datasets
-- evidence suggests that this level will not be
reached for weeks or months (or more) of further
analysis.
Performance of TNT with time
27
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

28
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

29
Observations
  • The best methods cannot get acceptably good
    solutions within 24 hours on most of these large
    datasets.
  • Datasets of these sizes may need months (or
    years) of further analysis to reach reasonable
    solutions.
  • Apparent convergence can be misleading.

30
Disk-Covering Methods (DCMs)
  • DCMs are divide-and-conquer methods that our
    group has developed for use in phylogeny
    reconstruction
  • DCM2 was designed for speeding up maximum
    parsimony and maximum likelihood heuristics. DCM2
    was good enough for PAUP.
  • DCM3 is a recent improvement over DCM2 which
    enables iteration (and gives smaller subproblems)
    - and is good enough for TNT.

31
Boosting MP heuristics
  • DCMs boost the performance of phylogeny
    reconstruction methods.

DCM
Base method M
DCM-M
32
DCM3 technique for speeding up MP searches
33
Iterative-DCM3
T
DCM3
Base method
T
34
New DCMs
  • DCM3
  • Compute subproblems using DCM3 decomposition
  • Apply base method to each subproblem to yield
    subtrees
  • Merge subtrees using the Strict Consensus Merger
    technique
  • Randomly refine to make it binary
  • Recursive-DCM3
  • Iterative DCM3
  • Compute a DCM3 tree
  • Perform local search and go to step 1
  • Recursive-Iterative DCM3

35
Boosting MP heuristics
  • We examine DCMs using DCM2 and DCM3, and using
    recursion and/or iteration.

DCM
Base method M
DCM-M
36
Performance Study
  • How well do these boosted versions of the best
    MP heuristics perform, compared to the best MP
    heuristics?
  • We examine performance with respect to optimal
    MP scores (best found so far, using any method)
    for a number of very large datasets, over 24
    hours.
  • The benchmark MP heuristic is the default TNT.

37
Datasets
Obtained from various researchers and online
databases
  • 1322 lsu rRNA of all organisms
  • 2000 Eukaryotic rRNA
  • 2594 rbcL DNA
  • 4583 Actinobacteria 16s rRNA
  • 6590 ssu rRNA of all Eukaryotes
  • 7180 three-domain rRNA
  • 7322 Firmicutes bacteria 16s rRNA
  • 8506 three-domain2org rRNA
  • 11361 ssu rRNA of all Bacteria
  • 13921 Proteobacteria 16s rRNA

38
Rec-I-DCM3 significantly improves performance
Current best techniques
DCM boosted version of best techniques
Comparison of TNT to Rec-I-DCM3(TNT) on one large
dataset
39
Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at
24 hours)
Base method is the default TNT technique, the
current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by
returning trees which are at most 0.01 above
optimal on most datasets.
40
Summary
  • Rec-I-DCM3 is a powerful technique for escaping
    local optima, and boosts the performance of the
    best heuristics for solving MP
  • The improvement increases with the difficulty of
    the dataset - Rec-I-DCM3(TNT) is 50 times faster
    than TNT on our hardest datasets, but we expect
    even bigger speedups in our next version
  • DCMs also boost the performance of Maximum
    Likelihood heuristics (not shown)

41
Acknowledgements
  • Collaborators Bernard Moret (UNM), Usman Roshan
    (UT-Austin), and Tiffani Williams (UNM)
  • Funding NSF, The David and Lucile Packard
    Foundation, The Radcliffe Institute for Advanced
    Study, The Institute for Cellular and Molecular
    Biology at UT-Austin, and The Program in
    Evolutionary Dynamics at Harvard University
  • Software will be part of the CIPRES Projects
    first distribution - see http//www.phylo.org

42
  • Cyber-Infrastructure for Phylogenetic RESearch
    (http//www.phylo.org)
  • Main research Large-scale phylogenetics,
    reticulate evolution, gene order phylogeny, and
    databases
  • Funded by 11.6M ITR Grant from NSF
  • 40 biologists, computer scientists, and
    mathematicians collaborating on the project
Write a Comment
User Comments (0)
About PowerShow.com