Knowledge Based Phylogenetic Classification Mining Isabelle Bichindaritz, University of Washington, - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Knowledge Based Phylogenetic Classification Mining Isabelle Bichindaritz, University of Washington,

Description:

Knowledge Based Phylogenetic Classification Mining ... The goal of phylogenetic classification is to construct cladograms following Hennig principles. ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 31
Provided by: isabellebi
Category:

less

Transcript and Presenter's Notes

Title: Knowledge Based Phylogenetic Classification Mining Isabelle Bichindaritz, University of Washington,


1
Knowledge Based Phylogenetic Classification
MiningIsabelle Bichindaritz, University of
Washington, TacomaStephen Potter, University of
AberdeenSociete Française de Systématique,
Museum National dHistoire Naturelle, Paris
2
Agenda
  • Phylogenetic classification problem
  • Methods in phyloinformatics
  • Knowledge based classification method
  • Phylsyst system
  • Conclusion

3
Phylogenetic Classification
  • Phylogenies are evolutionary trees.
  • Systematics has for goal to construct
    classifications
  • Morphological systematics
  • Phylogenetic systematics (Hennig, 1913).
  • Phylogeny is the sequence of events involved in
    the evolutionary development of a species or
    taxonomic group of organisms.
  • Phylon is the Greek word for race, tribe, class,
    akin.

4
Phylogenetic Classification
  • The goal of phylogenetic classification is to
    construct cladograms following Hennig principles.
  • Taxa are groups of organisms such as classes,
    orders, families, genres, and species.
  • Cladograms are rooted phylogenetic trees, where
    the root is the hypothetical common ancestor of
    the taxa (species ) in the tree.

5
Phylogenetic Classification
6
Phylogenetic Classification
  • The starting point of phylogenetic classification
    is a taxon matrix.
  • A taxon is described as a list of character
    values, such as (1, 0, 1, 2, 1, 2,
    0) are the values associated with seven
    different characters.
  • Characters can be morphologic, genetic, or
    molecular.
  • Several types of characters
  • Unordered values carry no evolutionary
    information 0, 1, 2
  • Undirected ordered values carry evolutionary
    information about the number of steps between
    them 0, 2
  • Directed ordered values also carry evolutionary
    information, but also constrain the evolution
    order 0, 2

7
Phylogenetic Classification
8
Phylogenetic Classification
  • Hennig Principles
  • First principle. Characters have states called
    plesiomorphic (primitive, generally 0) and
    apomorphic (evolved, generally gt 0).

9
Phylogenetic Classification
  • Hennig Principles
  • Second principle. A group is monophyletic if its
    taxa share apomorphies. These traits are
    synapomorphic. Monophyletic refers to a group
    that is descended from a single common ancestor.
    The sharing of plesiomorphic characters, called
    synplesiomorphic traits, defines a common
    ancestor more distant than synapomorphic ones,
    and thus this ancestor will not be exclusive.

10
Phylogenetic Classification
  • Hennig Principles
  • Third principle. The more a group shares
    synapomorphies, the more likely it is to be
    monophyletic.
  • Fourth principle. The assumption by default is
    that each modified character shared by the
    members of a group has appeared only once in this
    group.

11
Methods in Phyloinformatics
  • Methods in phyloinformatics aim at constructing
    phylogenetic classifications based on Hennig
    principles.
  • The most widely spread methods are
  • Parsimony-based.
  • Compatibility-based.

12
Methods in Phyloinformatics
  • Parsimony
  • Minimize the number of character state changes
    among the taxa (the simplest evolutionary
    hypothesis).
  • PAUP system (Phylogenetic Analysis Using
    Parsimony)

13
Methods in Phyloinformatics
  • Parsimony
  • Parsimony is the minimization of homoplasies.
  • Homoplasies are evolutionary mistakes such as
  • parallelism - apparition of the same derived
    character independently between two groups
  • convergence - state obtained by the independent
    transformation of two characters
  • reversion - evolution of one character from a
    more derived state to a more primitive one

14
Methods in Phyloinformatics
  • Parsimony
  • FITCH method. For unordered characters.
  • WAGNER method. For ordered undirected characters.
  • CAMIN-SOKAL method. For ordered undirected
    characters, it prevents reversion, but allows
    convergence and parallelism.
  • DOLLO method. For ordered directed characters, it
    prevents convergence and parallelism, but not
    reversion.
  • Polymorphic method. In chromosome inversion, it
    allows hypothetical ancestors to have polymorphic
    characters, which means that they can have
    several values.

15
Methods in Phyloinformatics
  • Compatibility
  • Maximize the number of characters mutually
    compatible on a cladogram.
  • Compatible characters are ones that present no
    homoplasy, neither reversion, nor parallelism,
    nor convergence.
  • Find the largest clique of compatible characters,
    a clique being a set of characters presenting no
    homoplasy.

16
Methods in Phyloinformatics
  • Parsimony and compatibility are equivalent.
  • Simplifications of Hennig Principles.
  • Approximation by numerical methods.

17
Knowledge Based Classification
  • Phylogenetic Analysis Not Using Parsimony PANUP
    (Bernard Sigwalt, SFS)
  • Goal to mine taxon matrices for phylogenetic
    classifications based more closely on Hennigs
    principles.
  • Find a network of synapomorphies as a basis to
    organize the taxa in monophyletic sub-trees.
  • Systematize the actual Hennig principles, and not
    some consequences of it like such methods as
    parsimony, maximum-likelihood, or compatibility.

18
Knowledge Based Classification
  • Mines matrices of character values that are
    binary and directed ordered.
  • 0 means plesiomorphic, and 1 means apomorphic

19
Knowledge Based Classification
  • PANUP algorithm done by hand by expert
    phylogeneticist.
  • Reasoning process
  • Find reciprocal synapomorphies look for doublons
    10, 01.
  • Continue recursively in submatrices until no
    doublon can be found.
  • Find triplons that do not lead to parallelism
    look for triplons 00, 01, 10 or 00, 11, 10
    (triplons 00, 11, 01 or 11, 01, 10 lead to
    parallelism).
  • Use different other heuristics to complete
    cladogram.

20
Knowledge Based Classification
  • Example
  • Find doublon 10, 01 C3 and C4 form this 10,
    01 doublon, and thus an exclusive synapomorphy.
    And this is the only such doublon in this matrix.

21
Knowledge Based Classification
  • Leads to two monophyletic groups T1, T2
    sharing exclusive synapomorphy C4, and T3, T4,
    T5 sharing exclusive synapomorphy C3

22
Knowledge Based Classification
  • Find triplons 00, 01, 10 or 00, 11, 10

23
Phylsyst System
  • PhylSyst is an intelligent system modeling the
    expert systematicians reasoning to build
    phylogenetic classifications.
  • It mines character matrices by reproducing how
    human phylogeneticists reason, representing their
    knowledge.
  • Only when the mining process leads to several
    plausible trees does it use a scoring function to
    determine the best tree. This scoring function is
    an evaluation of how well the competing trees fit
    Hennig original principles.

24
Phylsyst System
  • Phylsyst algorithm proceeds through several
    steps
  • 1.  Search for 01, 10 doublons (reciprocal
    synapomorphies), then recursively in
    sub-matrices.
  • 2.  Search for 00, 11 doublons (several evolved
    characters exclusively shared by one branch of
    the tree).
  • 3.  Hypothesize a 01, 10 doublon (find a 01,
    10 doublon by modifying less than a certain
    percentage of characters for instance 25 ).
  • 4.  Search for triplons.
  • 5.  Search for quadruplons.
  • 6.  Evaluation of the best cladogram.

25
Phylsyst System
  • Scoring function
  • Maximize the number of 01, 10 doublons.
  • Maximize the number of synapomorphies closer to
    the root by minimizing score2
  • Maximize the branching nodes splitting the tree
    on large number of taxa

26
Phylsyst System
  • Example Berberidac. family

27
Phylsyst System
  • Example Phylsyst built cladogram

28
Phylsyst System
  • Summary
  • Tree mining by representing closely Hennig
    Principles and phylogeneticist expert
    classification method.
  • Related to tree mining
  • Decision tree induction
  • Conceptual clustering
  • Different because phylogenetic tree represents an
    evolution in time
  • Case based reasoning to memorize and reuse search
    trees

29
Conclusion
  • Phylsyst distributed with Biosystema special
    issue
  • Improvement ideas
  • Graphical user interface
  • Add genetic characters to morphological character
  • Explanation of cladograms built
  • Model more closely phylogeneticists reasoning
  • Scale up to large datasets

30
Conclusion
  • Informatics novel and efficient algorithms and
    tools needed for the worldwide effort to Assemble
    the Tree of Life.
  • Important role to play by computational
    intelligence.
Write a Comment
User Comments (0)
About PowerShow.com