Title: Knowledge Based Phylogenetic Classification Mining Isabelle Bichindaritz, University of Washington,
1Knowledge Based Phylogenetic Classification
MiningIsabelle Bichindaritz, University of
Washington, TacomaStephen Potter, University of
AberdeenSociete Française de Systématique,
Museum National dHistoire Naturelle, Paris
2Agenda
- Phylogenetic classification problem
- Methods in phyloinformatics
- Knowledge based classification method
- Phylsyst system
- Conclusion
3Phylogenetic Classification
- Phylogenies are evolutionary trees.
- Systematics has for goal to construct
classifications - Morphological systematics
- Phylogenetic systematics (Hennig, 1913).
- Phylogeny is the sequence of events involved in
the evolutionary development of a species or
taxonomic group of organisms. - Phylon is the Greek word for race, tribe, class,
akin.
4Phylogenetic Classification
- The goal of phylogenetic classification is to
construct cladograms following Hennig principles.
- Taxa are groups of organisms such as classes,
orders, families, genres, and species. - Cladograms are rooted phylogenetic trees, where
the root is the hypothetical common ancestor of
the taxa (species ) in the tree.
5Phylogenetic Classification
6Phylogenetic Classification
- The starting point of phylogenetic classification
is a taxon matrix. - A taxon is described as a list of character
values, such as (1, 0, 1, 2, 1, 2,
0) are the values associated with seven
different characters. - Characters can be morphologic, genetic, or
molecular. - Several types of characters
- Unordered values carry no evolutionary
information 0, 1, 2 - Undirected ordered values carry evolutionary
information about the number of steps between
them 0, 2 - Directed ordered values also carry evolutionary
information, but also constrain the evolution
order 0, 2
7Phylogenetic Classification
8Phylogenetic Classification
- Hennig Principles
- First principle. Characters have states called
plesiomorphic (primitive, generally 0) and
apomorphic (evolved, generally gt 0).
9Phylogenetic Classification
- Hennig Principles
- Second principle. A group is monophyletic if its
taxa share apomorphies. These traits are
synapomorphic. Monophyletic refers to a group
that is descended from a single common ancestor.
The sharing of plesiomorphic characters, called
synplesiomorphic traits, defines a common
ancestor more distant than synapomorphic ones,
and thus this ancestor will not be exclusive.
10Phylogenetic Classification
- Hennig Principles
- Third principle. The more a group shares
synapomorphies, the more likely it is to be
monophyletic. - Fourth principle. The assumption by default is
that each modified character shared by the
members of a group has appeared only once in this
group.
11Methods in Phyloinformatics
- Methods in phyloinformatics aim at constructing
phylogenetic classifications based on Hennig
principles. - The most widely spread methods are
- Parsimony-based.
- Compatibility-based.
12Methods in Phyloinformatics
- Parsimony
- Minimize the number of character state changes
among the taxa (the simplest evolutionary
hypothesis). - PAUP system (Phylogenetic Analysis Using
Parsimony)
13Methods in Phyloinformatics
- Parsimony
- Parsimony is the minimization of homoplasies.
- Homoplasies are evolutionary mistakes such as
- parallelism - apparition of the same derived
character independently between two groups - convergence - state obtained by the independent
transformation of two characters - reversion - evolution of one character from a
more derived state to a more primitive one
14Methods in Phyloinformatics
- Parsimony
- FITCH method. For unordered characters.
- WAGNER method. For ordered undirected characters.
- CAMIN-SOKAL method. For ordered undirected
characters, it prevents reversion, but allows
convergence and parallelism. - DOLLO method. For ordered directed characters, it
prevents convergence and parallelism, but not
reversion. - Polymorphic method. In chromosome inversion, it
allows hypothetical ancestors to have polymorphic
characters, which means that they can have
several values.
15Methods in Phyloinformatics
- Compatibility
- Maximize the number of characters mutually
compatible on a cladogram. - Compatible characters are ones that present no
homoplasy, neither reversion, nor parallelism,
nor convergence. - Find the largest clique of compatible characters,
a clique being a set of characters presenting no
homoplasy.
16Methods in Phyloinformatics
- Parsimony and compatibility are equivalent.
- Simplifications of Hennig Principles.
- Approximation by numerical methods.
17Knowledge Based Classification
-
- Phylogenetic Analysis Not Using Parsimony PANUP
(Bernard Sigwalt, SFS) - Goal to mine taxon matrices for phylogenetic
classifications based more closely on Hennigs
principles. - Find a network of synapomorphies as a basis to
organize the taxa in monophyletic sub-trees. - Systematize the actual Hennig principles, and not
some consequences of it like such methods as
parsimony, maximum-likelihood, or compatibility.
18Knowledge Based Classification
- Mines matrices of character values that are
binary and directed ordered. - 0 means plesiomorphic, and 1 means apomorphic
19Knowledge Based Classification
- PANUP algorithm done by hand by expert
phylogeneticist. - Reasoning process
- Find reciprocal synapomorphies look for doublons
10, 01. - Continue recursively in submatrices until no
doublon can be found. - Find triplons that do not lead to parallelism
look for triplons 00, 01, 10 or 00, 11, 10
(triplons 00, 11, 01 or 11, 01, 10 lead to
parallelism). - Use different other heuristics to complete
cladogram.
20Knowledge Based Classification
- Example
- Find doublon 10, 01 C3 and C4 form this 10,
01 doublon, and thus an exclusive synapomorphy.
And this is the only such doublon in this matrix.
21Knowledge Based Classification
- Leads to two monophyletic groups T1, T2
sharing exclusive synapomorphy C4, and T3, T4,
T5 sharing exclusive synapomorphy C3
22Knowledge Based Classification
- Find triplons 00, 01, 10 or 00, 11, 10
23Phylsyst System
- PhylSyst is an intelligent system modeling the
expert systematicians reasoning to build
phylogenetic classifications. - It mines character matrices by reproducing how
human phylogeneticists reason, representing their
knowledge. - Only when the mining process leads to several
plausible trees does it use a scoring function to
determine the best tree. This scoring function is
an evaluation of how well the competing trees fit
Hennig original principles.
24Phylsyst System
- Phylsyst algorithm proceeds through several
steps - 1. Search for 01, 10 doublons (reciprocal
synapomorphies), then recursively in
sub-matrices. - 2. Search for 00, 11 doublons (several evolved
characters exclusively shared by one branch of
the tree). - 3. Hypothesize a 01, 10 doublon (find a 01,
10 doublon by modifying less than a certain
percentage of characters for instance 25 ). - 4. Search for triplons.
- 5. Search for quadruplons.
- 6. Evaluation of the best cladogram.
25Phylsyst System
- Scoring function
- Maximize the number of 01, 10 doublons.
- Maximize the number of synapomorphies closer to
the root by minimizing score2 - Maximize the branching nodes splitting the tree
on large number of taxa
26Phylsyst System
- Example Berberidac. family
27Phylsyst System
- Example Phylsyst built cladogram
28Phylsyst System
- Summary
- Tree mining by representing closely Hennig
Principles and phylogeneticist expert
classification method. - Related to tree mining
- Decision tree induction
- Conceptual clustering
- Different because phylogenetic tree represents an
evolution in time - Case based reasoning to memorize and reuse search
trees
29Conclusion
- Phylsyst distributed with Biosystema special
issue - Improvement ideas
- Graphical user interface
- Add genetic characters to morphological character
- Explanation of cladograms built
- Model more closely phylogeneticists reasoning
- Scale up to large datasets
30Conclusion
- Informatics novel and efficient algorithms and
tools needed for the worldwide effort to Assemble
the Tree of Life. - Important role to play by computational
intelligence.