Phylogenetic Networks - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Phylogenetic Networks

Description:

Mathematical model for representing evolutionary histories among taxa ... tree itself does not provide us with all the information, we ... (N,i) = minT' IN T(N) ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 44
Provided by: annat5
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Networks


1
Phylogenetic Networks
  • Anna Tholse
  • MS Thesis Defense
  • Department of Computer Science
  • July 10, 2003

2
Outline
  • Background
  • Network generation
  • Distance measure for networks
  • Network reconstruction
  • Conclusion

3
Phylogenetic Trees
  • Mathematical model for representing evolutionary
    histories among taxa
  • Rooted or unrooted
  • Leaves taxa for which we have sampled data
  • Internal nodes hypothetical ancestors

4
Model Trees
  • Not enough biological datasets exists
  • Algorithms for simulating the true phylogeny have
    been developed
  • Underlying model of topology
  • uniform random all topologies are equally likely
  • birth-death well balanced topologies,
    biologically meaningful

5
Sequence Evolution
  • The model tree itself does not provide us with
    all the information, we need sequence data
  • Mutational changes on the sequence (nucleotide
    substitution, insertion and deletions, and
    recombination)
  • Seq-gen (Rambaut and Grassly, 1997)

6
Distance Measure
  • Assess the performance of a reconstruction method
    by computing the distance between the inferred
    phylogeny and the model phylogeny
  • Robinson-Foulds
  • Every edge e in a leaf-labeled tree T defines a
    bipartition be
  • T is encoded by C(T) be e IN E(T)
  • n-3 internal edges in an unrooted tree

7
Robinson-Foulds Cont.
  • False Positive rate (C(T2) - C(T1)) / (n-3)
  • False Negative rate (C(T1) - C(T2)) / (n-3)
  • RF value (FP FN) / 2

8
Reconstruction Methods
  • Maximum parsimony optimization method
  • Produces the tree (or trees) that needs the
    fewest evolutionary changes between sequences
  • Maximum likelihood optimization method
  • Produces the tree (or trees) that is most likely
    to give rise to the given sequences
  • Neighbor-joining distance method
  • Produces the tree (or trees) that minimizes the
    total branch lengths

9
Simulations
  • Important for testing the performance of
    phylogeny reconstruction methods
  • Can generate test sets in arbitrary large numbers
    with different settings
  • Parameter space is large
  • Test a large range of parameters and do many runs
    for each setting to estimate the variance

10
Simulation Flow
  • Create model topology
  • Evolve sequences on the topology
  • Feed resulting leaf sequences to the studied
    reconstruction method
  • Compute distance between inferred phylogeny and
    model phylogeny

11
Outline
  • Background
  • Network generation
  • Distance measure for networks
  • Network reconstruction
  • Conclusion

12
Related Work
  • SplitsTree (Huson, 1998) and NeighborNet (Bryant
    and Moulton, 2002). Representing incompatible
    splits.
  • Ancestral Recombination Graphs (Hudson, 1983 and
    Griffiths Marjoram, 1996). Extension of the
    coalescent model.
  • Lateral Gene Transfer (Hallett et al. 2001,
    2003). Detection and representation.

13
Non-treelike Evolutionary Events
  • Not all evolutionary events can be captured by a
    tree
  • Hybridization two lineages (edges) combine and
    create a new lineage (edge)
  • Lateral gene transfer genetic material from one
    lineage (edge) is transferred to another lineage
  • Definition homolog - a member of a chromosome
    pair

14
Hybridization
  • A network (a) and its two induced trees (b,c)
  • Different kinds of hybridization
  • Diploid same of homologs as its 2 parents
  • Polyploid double of homologs as its 2 parents
  • Auto-polyploid double of homologs as parent
    lineage

15
Network Representation
  • Rooted directed acyclic graphs (DAGs)
  • Three kinds of nodes
  • Root node indegree 0 and outdegree 2
  • Tree node
  • indegree 1 and outdegree 2 (internal node)
  • indegree 1 and outdegree 0 (leaf node)
  • Network node
  • indegree 2 and outdegree 1 (diploid or
    polyploid)
  • indegree 1 and outdegree 1 (auto-polyploid)

16
Network Representation Cont.
  • tree node
  • network node
  • tree edge

--- network edge
  • Two kinds directed edges
  • e(u,v) is a tree edge iff v is a tree node
  • e(u,v) is a network edge iff v is a network node

17
Evolutionary Events
  • Extinction A new node u is created at the end of
    a lineage, no new lineage is started from u
  • Speciation A new node u is created at the end of
    a lineage, and two new lineages are started from
    u
  • Hybridization A new node u is created
  • when two lineages combine (diploid or polyploid)
  • when one lineage creates u and the new lineage
    from u has double the number of homologs
    (auto-polyploid)

18
Network Generation I
  • Start with one node (the root), and two sequences
    (the homologs), setup an initial speciation that
    starts two lineages
  • Consider, at any time t, all existing lineages
    and with probability p an evolutionary events
    takes place
  • hybridization find a coexisting lineage that
    also seeks to hybridize (the evolutionary
    distance between the two lineages cannot be too
    large)
  • Evolve sequences at each new node created

19
Network Generation II
  • Generate a birth-death tree, let time on the root
    be 0 and time at the last generated leaf be tl
    and let ti be in the range 0, tl
  • Find all nodes who do not exceed time ti but
    whose children do
  • Calculate evolutionary distance between the found
    nodes

20
Network Sequence Evolution
  • In our simpler model we evolve sequences after
    the phylogeny is created.
  • Seq-gen2, evolves the two homologs
    simultaneously. Similar to Seq-gen, but at each
    network node, the node randomly inherits one of
    the parents two homologs.
  • Output the evolved pairs of sequences for each
    leaf in the network.

21
Outline
  • Background
  • Network generation
  • Distance measure for networks
  • Network reconstruction
  • Conclusion

22
Network Distance Measure
u
v
  • Each edge e, in a rooted network induces a
    tripartition on the leaves
  • e (u,v) X(e) A,B, Y(e) C,D, Z(e)
    E,F

23
Robinson-Foulds Extended
  • FP(N1,N2) e2 IN E(N2) not ? e1 IN E(N1), e1
    ? e2 / E(N2)
  • FN(N1,N2) e1 IN E(N1) not ? e2 IN E(N2), e1
    ? e2 / E(N1)
  • RF(N1,N2) (FN(N1,N2) FP(N1,N2) / 2

24
Convergence
  • Convergence might cause the metric to return 0 in
    cases where N1 and N2 do NOT have the same
    topology X Y make up a convergent set in both
    networks

25
Class I and Class II Networks
  • Class I network does not contain a convergent
    set
  • Class II network contains a convergent set
  • Low probability of a class II network

26
Measure is a Provable Metric
  • The pair (N,m), where N is the space of Class I
    phylogenetic networks and m(.,.) is our error
    measure, is a metric space.
  • For more details and proofs see Thesis or An
    Error Metric for Phylogenetic Networks. Linder et
    al. Department of Computer Science. Technical
    Report. 2003.

27
Evaluating the Metric Experimental Setup
  • Number of network nodes 0, 1, 2, 3, 4, and 5
  • Number of taxa 10, 20, 40, and 80
  • Sequence lengths 25, 50, 100, 250 and 500
  • Scaling 0.1, 0.5, 1 and 2
  • SplitsTree, neighbor-joining, and greedy maximum
    parsimony (using PAUP)

28
Experimental Results
80 taxa, edge scaling 0.5, sequence length 500
  • SplitsTree introduces too many network nodes

29
Experimental Results Cont.
40 taxa, edge scaling 0.5, sequence length 1000
  • Error rate grows as a function of the number of
    network nodes present in model network

30
Experimental Results Cont.
80 taxa, edge scaling 0.5 , 1 network node in
model network
  • MP and NJ Slow decrease in error as sequence
    length increases
  • Metric performance Performs as expected, it does
    neither under- nor overemphasize the importance
    of network nodes

31
Outline
  • Background
  • Network generation
  • Distance measure for networks
  • Network reconstruction
  • Conclusion

32
Network Reconstruction
  • Inferred phylogeny might look significantly
    different from model phylogeny due to
  • Extinction
  • Missing data in taxon sampling
  • Lineage has undergone two or more simultaneous
    hybridization events

33
Related Work
  • Most parsimonious network (Fitch, 1997)
  • Parsimony for reconstructing evolutionary
    histories when recombination is present (Hein,
    1990)

34
Parsimony on Phylogenetic Networks
  • The evolutionary history of a site i in a set S
    of sequences that evolved on a network N is
    captured by one of the trees induced by N.
  • Parsimony score of a network N leaf-labeled by a
    set of taxa S is

NCost(N,S) ?i Cost(N,i) where
Cost(N,i) minT IN T(N) TCost(T,i)
35
Fixed-tree Maximum Parsimony on Phylogenetic
Networks (FTMPPN)
  • Input A tree T, leaf-labeled by a set of aligned
    sequences, S, and a bound B
  • Output A phylogenetic network N (containing T)
    with at most B network nodes, with leaves labeled
    by S and internal nodes labeled by additional
    sequences that minimizes NCost(N,S)

36
FTMPPN Investigated
  • Determine if the best scoring phylogenetic
    network shows a bias towards fewer, equivalent,
    or larger number of network nodes than present in
    the model network
  • Assess if the inferred topology is accurate
    compared to the model topology

37
FTMPPN Investigated Cont.
  • Take a network, N, leaf-labeled by a set of
    sequences, S
  • Find all the trees that are contained within N
  • Introduce at most B network nodes to N to
    minimize NCost

38
Experimental Setup
  • Used a subset of the generated data from previous
    experiments
  • Used model networks directly
  • Added at most 5 network nodes to each tree of the
    network

39
Number of Network Nodes (inferred vs. model)
80 taxa, edge scaling 0.5, sequence length 500
  • The heuristic infers too many network nodes

40
Topological Accuracy
40 taxa, edge scaling 0.5, sequence length 1000
  • Poor, the inferred network nodes do not
    correspond to the ones in the model network

41
Conclusions
  • Computational tools for phylogenetic network
    generation and sequence evolution on the networks
  • Metric for the evaluation of the performance of
    phylogenetic network reconstruction methods
  • Reconstructed network topologies might look
    different from model network

42
Conclusions Cont.
  • Reconstruction using maximum parsimony where the
    minimum of each site is taken appears ill-suited.
    A better way might be to take the average of each
    site over all trees?

43
Acknowledgements
  • Bernard Moret
  • UT Austin Randy Linder, Luay Nakhleh, Anneke
    Padolina, Jerry Sun, Ruth Timme, and Tandy Warnow
Write a Comment
User Comments (0)
About PowerShow.com