HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data - PowerPoint PPT Presentation

Loading...

PPT – HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data PowerPoint presentation | free to download - id: 100a76-Yzk2Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data

Description:

CAREER: High-Performance Algorithms for Scientific Applications (00-93039) ... public health entomology. sequence motifs for groups are patented ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 26
Provided by: davida204
Learn more at: http://ftp.cse.sc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: HighPerformance Computing for Reconstructing Phylogenies from GeneOrder Data


1
High-Performance Computing forReconstructing
Phylogenies fromGene-Order Data
  • David A. Bader
  • Electrical Computer Engineering
  • University of New Mexico
  • dbader_at_eece.unm.edu
  • http//hpc.eece.unm.edu/

2
Acknowledgment of Support
  • National Science Foundation
  • CAREER High-Performance Algorithms for
    Scientific Applications (00-93039)
  • ITR Algorithms for Irregular Discrete
    Computations on SMPs (00-81404)
  • DEB Ecosystem Studies Self-Organization of
    Semi-Arid Landscapes Test of Optimality
    Principles (99-10123)
  • ITR/AP Reconstructing Complex Evolutionary
    Histories (01-21377)
  • DEB Comparative Chloroplast Genomics Integrating
    Computational Methods, Molecular Evolution, and
    Phylogeny (01-20709)
  • ITR/AP(DEB) Computing Optimal Phylogenetic Trees
    under Genome Rearrangement Metrics (01-13095)
  • PACI NCSA/Alliance, NPACI/SDSC, PSC
  • Sun Microsystems

3
Algorithms that Scale from the Blade to the Fire
4
Commercial Aspects of Phylogeny Reconstruction
  • Identification of microorganisms
  • public health entomology
  • sequence motifs for groups are patented
  • example differentiating tuberculosis strains
  • Dynamics of microbial communities
  • pesticide exposure identify and quantify
    microbes in soil
  • Vaccine development
  • variants of a cell wall or protein coat component
  • porcine reproductive and respiratory syndrome
    virus isolates from US and Europe were separate
    populations
  • HIV studied through DNA markers
  • Biochemical pathways
  • antibacterials and herbicides
  • Glyphosate (Roundup?, Rodeo ?, and Pondmaster ?)
    first herbicide targeted at a pathway not present
    in mammals
  • phylogenetic distribution of a pathway is studied
    by the pharmaceutical industry before a drug is
    developed
  • Pharmaceutical industry
  • predicting the natural ligands for cell surface
    receptors which are potential drug targets
  • a single family, G protein coupled receptors
    (GPCRs), contains 40 of the targets of most
    pharm. companies

5
GRAPPA Genome Rearrangements Analysis
  • Genome Rearrangements Analysis under Parsimony
    and other Phylogenetic Algorithms
  • http//www.cs.unm.edu/moret/GRAPPA/
  • Open-source
  • already used by other computational phylogeny
    groups, Caprara, Pevzner, LANL, FBI, PharmCos.
  • Gene-order Phylogeny Reconstruction
  • Breakpoint Median
  • Inversion Median
  • over one-million fold speedup from previous codes
  • Parallelism
  • Scales linearly with the number of processors
  • Developed using Sun Forte C

6
Molecular Data for Phylogeny
  • simple DNA sequence nucleotides
  • low-level functionality amino acids, etc.
  • genomic level genes
  • (next is functional level proteomics, etc.)
  • Biologists now have full gene sequences for
    many single-chromosome organisms and organelles
    (e.g., mitochondria, chloroplasts) and for more
    and more larger organisms

7
Gene Order Phylogeny
  • Many organelles appear to evolve mostly through
    processes that simply rearrange gene ordering
    (inversion, transposition) and perhaps alter gene
    content (duplication, loss).
  • Chloroplast have a single, typically circular,
    chromosome and appear to evolve mostly through
    inversion

The sequence of genes i, i1, , j is inverted
and every gene is flipped.
8
Gene Order Phylogeny (contd)
  • The real problem
  • Reconstruct the true tree, identify the true
    ancestral genomes, and recover on each edge the
    true sequence of evolutionary changes
  • The optimization problem (parsimony)
  • Reconstruct a tree and ancestral genomes so as to
    minimize the sum, over all tree edges, of the
    inferred evolutionary distance along each edge
  • The surrogate problem
  • Do the optimization problem with a measure of
    inferred evolutionary distance that lends itself
    to analysis

9
Breakpoint AnalysisA Surrogate for Gene Order
  • Breakpoint
  • an adjacent pair of genes present in one genome,
    but absent in the other
  • Breakpoint distance
  • the total number of breakpoints between two
    genomes (a true metric, similar to Hamming
    distance)
  • Breakpoint phylogeny
  • the tree and ancestral genomes that minimize the
    sum, over all edges of the tree, of the
    breakpoint distances
  • Naturally, it is an NP-hard problem, even with
    just 3 leaves.

10
Breakpoint Analysis(Sankoff Blanchette 1998)
  • For each tree topology do
  • somehow assign initial genomes to the internal
    nodes
  • repeat
  • for each internal node do
  • compute a new genome that minimizes the distances
    to its three neighbors
  • replace old genome by new if distance is reduced
  • until no change
  • Sankoff Blanchette implemented this in a C
    package

(2n-5)!! (2n-5) (2n-7) ? 5 ? 3 trees
unknown iterative heuristic
NP-hard
11
Algorithm Engineering Works!
  • We reimplemented everything
  • the original code is too slow and not as flexible
    as we wanted.
  • Our main dataset is a collection of chloroplast
    data from the flowering plant family
    Campanulaceae (bluebells)
  • 13 genomes of 105 gene segments each
  • On our old workstation
  • BPAnalysis processes 10-12 trees/minute
  • Our implementation processes over 50,000
    trees/minute
  • Speedup ratio is over 5,000!!
  • On synthetic datasets, we see speedups from 300
    to over 50,000

12
So What did we do?!

( 10x)
  • Absolutely no high-level algorithmic changes
  • Three low-level algorithmic changes
  • better bounding
  • strong upper bound initialization
  • condensing
  • Completely different data representation
  • Two low-level algorithmic changes
  • all memory is pre-allocated
  • some loops are hand-unrolled
  • Written in C instead of C

10x
10x
6x
? (convenience)
Well, so I lied just a little bit
13
One high-level algorithmic change
  • (Ok, so I lied a little)
  • Avoid labeling the tree if possible
  • Use current best score as an upper bound.
  • Compute lower bound prune tree away if lower
    bound gt upper bound
  • Lower bound
  • Get circular ordering of leaves, x1 x2 xn
  • Compute D d(x1,x2) d(x2,x3) d(xn,x1)
  • Then ½ D is a lower bound because
  • d(.) obeys the triangle inequality
  • every tree edge is used twice in a tree-based
    version of D

14
Tree
Tree
Tree version (paths)
d(e,a)
d(d,e)
d(a,b)
d(c,d)
d(b,c)
(Same trick as in the twice around the tree
approximation for the TSP with triangle
inequality.)
D d(a,b) d(b,c) d(c,d) d(d,e)
d(e,a)
15
Algorithmic Changes ( 10x)
  • Better bounding skip edges that would cause
    degree 3 or premature cycle
  • Condensing whenever the same
  • gene subsequence appears in all genomes, it can
    be condensed into a single superfragment
  • done as static processing and on the fly before
    each TSP
  • Initializing the new median with the best of the
    old one and its three neighbors.
  • Condensing is very effective on real data within
    families, but easily defeated by large
    evolutionary distances.
  • (1) and (3) cause over half of the TSP instances
    (for finding computing median-of-three updated
    internal nodes) to be pruned away instantly.

16
Data Representation ( 10x)
  • No distance matrix for reduction of
    median-of-three to Traveling Salesperson
    Problem (TSP) at most 4n edges can be of
    interest the others are treated as an
    undifferentiated pool. The adjacency lists have
    length ? 4.
  • thus, linear time at each step and reduced
    storage.
  • Backtracking search has a small list of edges and
    only searches among edges of cost 1 and 2 (? and
    0 are always included) still NP-hard, but often
    easy
  • When search runs out of edges, tour is completed
    in linear time from the pool of edges of cost 3
  • Many auxiliary arrays (á la Fortran!) to carry
    information on flags, degrees, other end of
    chains,

17
Low-Level Coding Changes ( 6x)
  • All storage allocated at start, with large s of
    pointers passed to subroutines (no globals, to
    allow parallel execution).
  • Avoids malloc/free overhead Improves cache
    locality
  • Avoid recomputations.
  • Use local variables for intermediate pointers
  • Hand unroll loops on adjacencies to preserve
    locality (and to avoid mod operations with
    circular genomes)
  • Speeds up addressing never deference!
    Improves cache locality
  • BPAnalysis uses 65MB and has a real memory
    footprint of 12MB on our real data
  • Our reimplementation uses 1.6MB with a footprint
    of 0.6MB

18
And How did we do it?
  • 3 strategies Profile, Profile, Profile
  • (and use your engineering sense/nose/ )
  • Sun Forte 6 Analyzer
  • We began with 4 main culprits
  • preparing adjacency lists for the TSP
  • computing breakpoint distances
  • computing lower bounds in TSP
  • backtracking in TSP
  • Over 10 12 major iterations, each of which
    yielded a 1.5 2 fold speed-up, these four
    switched places over and over.

19
Profiling
  • And our final tally (still on the Campanulaceae
    dataset) is
  • 30 backtracking (excl. LB)
  • 20 preparing adjacency lists
  • 20 condensing expanding
  • 15 computing LB
  • 8 computing distances
  • 7 miscellaneous overhead
  • (no obvious culprits left)

20
High-Performance Computing Techniques
  • Availability of hundreds of powerful processors
  • Standard parallel programming interfaces (Sun
    HPC)
  • Message passing interface (MPI)
  • OpenMP or POSIX threads
  • Algorithmic libraries for SMP clusters
  • SIMPLE
  • Goal make efficient use of parallelism for
  • exploring candidate tree topologies
  • sharing of improved bounds

21
Parallelization of the Phylogeny Algorithm
  • Enumerating tree topologies is pleasantly
    parallel and allows multiple processors to
    independently search the tree space with little
    or no overhead
  • Improved bounds can be broadcast to other
    processors without interrupting work
  • Load is evenly balanced when trees are cyclically
    assigned (e.g. in a round-robin fashion) to the
    processors
  • Linear speedup

22
Final Remarks
  • Our reimplementation led to numerous extensions
    as well as to new theoretical results
  • GRAPPA has been extended to inversion phylogeny,
    with linear-time algorithms for inversion
    distance and a new approach to exact inversion
    median-of-three.
  • Better bounding in the next version of GRAPPA
    yields two more orders of magnitude speedup.
  • These insights and improvements are made possible
    by mature development tools (Forte)
  • Algorithmic engineering techniques are widely
    applicable
  • We may not always get 6 orders of magnitude, but
    3 4 orders should be nearly routine with most
    codes. (We are starting work on TBR and exact
    parsimony solvers.)

23
Final Remarks (contd)
  • High-performance implementations enable
  • better approximations for difficult problems (MP,
    ML)
  • true optimization for larger instances
  • realistic data exploration (e.g., testing
    evolutionary scenarios, assessing answers
    obtained through other means, etc.)
  • Our analysis of the Campanulaceae dataset
    confirmed the conjecture of Robert Jansen et al.
    that inversion is the principal process of
    genome evolution in cpDNA for this group.

24
Work-In-Progress and Future Work
  • Tree enumeration using circular ordering
  • Handle unequal gene content and duplicate genes
    using exemplars
  • Parallel branch and bound techniques (optimized
    for Sun HPC Servers) for searching tree space
  • Improved SPR and TBR techniques (local searches
    around good trees)
  • Exact Algorithm for Maximum Parsimony

25
Recent publications (2001)
  • A New Implementation and Detailed Study of
    Breakpoint Analysis, B.M.E. Moret, S. Wyman, D.A.
    Bader, T. Warnow, M. Yan, Sixth Pacific Symposium
    on Biocomputing 2001, pp. 583-594, Hawaii,
    January 2001.
  • High-Performance Algorithm Engineering for
    Gene-Order Phylogenies, D.A. Bader, B. M.E.
    Moret, T. Warnow, S.K. Wyman, and M. Yan, DIMACS
    Workshop on Whole Genome Comparison, DIMACS
    Center, Rutgers University, Piscataway, NJ, March
    2001.
  • Variation in vegetation growth rates
    Implications for the evolution of semi-arid
    landscapes, C. Restrepo, B.T. Milne, D. Bader, W.
    Pockman, and A. Kerkhoff, 16th Annual Symposium
    of the US-International Association of Landscape
    Ecology, Arizona State University, Tempe, April
    2001.
  • High-Performance Algorithm Engineering for
    Computational Phylogeny, B. M.E. Moret, D.A.
    Bader, and T. Warnow, 2001 International
    Conference on Computational Science, San
    Francisco, CA, May 2001.
  • Cluster Computing Applications, David A. Bader
    and Robert Pennington, The International Journal
    of High Performance Computing, 15(2)181-185, May
    2001.
  • New approaches for using gene order data in
    phylogeny reconstruction, R.K. Jansen, D.A.
    Bader, B. M. E. Moret, L.A. Raubeson, L.-S. Wang,
    T. Warnow, and S. Wyman. Botany 2001,
    Albuquerque, NM, August 2001.
  • GRAPPA a high-performance computational tool for
    phylogeny reconstruction from gene-order data, B.
    M.E. Moret, D.A. Bader, T. Warnow, S.K. Wyman,
    and M. Yan. Botany 2001, Albuquerque, NM, August
    2001.
  • Inferring phylogenies of photosynthetic organisms
    from chloroplast gene orders, L.A. Raubeson, D.A.
    Bader, B. M.E. Moret, L.-S. Wang, T. Warnow, and
    S.K. Wyman. Botany 2001, Albuquerque, NM, August
    2001.
  • Industrial Applications of High-Performance
    Computing for Phylogeny Reconstruction, D.A.
    Bader, B. M.E. Moret, and L. Vawter, SPIE ITCom
    Commercial Applications for High-Performance
    Computing, Denver, CO, SPIE Vol. 4528, pp.
    159-168, August 2001.
  • Using PRAM Algorithms on a Uniform-Memory-Access
    Shared-Memory Architecture, D.A. Bader, A.
    Illendula, B. M.E. Moret, and N.R.
    Weisse-Bernstein, Fifth Workshop on Algorithm
    Engineering, Springer-Verlag LNCS 2141, 129-144,
    Aarhus, Denmark, August 2001.
  • A Linear-Time Algorithm for Computing Inversion
    Distance Between Two Signed Permutations with an
    Experimental Study, D.A. Bader, B. M.E. Moret,
    and M. Yan, Journal of Computational Biology,
    8(5)483-491, October 2001.
About PowerShow.com