Title: BCB%20444/544
1 BCB 444/544
- Lecture 31
- Phylogenetics Character-Based Methods
- 31_Nov05
2 Required Reading (before lecture)
- Fri Oct 30 - Lecture 30
- Phylogenetic Distance-Based Methods
- Chp 11 - pp 142 169
- Mon Nov 5 - Lecture 31
- Phylogenetics Parsimony and ML
- Chp 11 - pp 142 169
- Wed Nov 7 - Lecture 32
- Machine Learning
- Fri Nov 9 - Lecture 33
- Functional and Comparative Genomics
- Chp 17 and Chp 18
3 Assignments Announcements
- Mon Oct 29 - HW5
- HW5 Hands-on exercises with
phylogenetics - and tree-building software
- Due Mon Nov 5 (not Fri Nov 1 as previously
posted)
4BCB 544 Only New Homework Assignment
- 544 Extra2
-
- Due vPART 1 - ASAP
- PART 2 - meeting prior to 5 PM Fri Nov 2
- Part 1 - Brief outline of Project, email to Drena
Michael - after response/approval, then
- Part 2 - More detailed outline of project
- Read a few papers and summarize status of
problem - Schedule meeting with Drena Michael to
discuss ideas -
5 Seminars this Week
- BCB List of URLs for Seminars related to
Bioinformatics - http//www.bcb.iastate.edu/seminars/index.html
- Nov 7 Wed - BBMB Seminar 410 in 1414 MBB
- Sharon Roth Dent MD Anderson Cancer Center
- Role of chromatin and chromatin modifying
proteins in regulating gene expression - Nov 8 Thurs - BBMB Seminar 410 in 1414 MBB
- Jianzhi George Zhang U. Michigan
- Evolution of new functions for proteins
- Nov 9 Fri - BCB Faculty Seminar 210 in 102
SciI - Amy Andreotti ISU
- Something about NMR
6Chp 11 Phylogenetic Tree Construction Methods
and Programs
- SECTION IV MOLECULAR PHYLOGENETICS
- Xiong Chp 11 Phylogenetic Tree Construction
Methods and Programs - Distance-Based Methods
- Character-Based Methods
- Phylogenetic Tree Evaluation
- Phylogenetic Programs
7Tree Construction
- Two main categories of tree building methods
- Distance-based
- Overall similarity between sequences
- Character-based
- Consider the entire MSA
8Summary of Distance-Based Methods
- Clustering-based methods
- Computationally very fast and can handle large
datasets that other methods cannot - Not guaranteed to find the best tree
- Optimality-based methods
- Better overall accuracies
- Computationally slow
- All distance-based methods lose all sequence
information and cannot infer the most likely
state at an internal node
9Character-Based Methods
- Based directly on the sequence characters in the
MSA rather than overall distances - Count mutational events accumulated on sequences
- Evolutionary dynamics of each character can be
studied and ancestral sequences inferred - Two popular approaches
- Parsimony
- Maximum Likelihood (ML)
10Parsimony
- Parsimony is based on Occams Razor the
simplest explanation is most likely correct - Goal Find the tree that allows evolution of the
sequences with the fewest changes
11Parsimony
- Parsimony score of a tree The smallest
(weighted) number of steps required by the tree - Two parsimony problems
- Large Parsimony problem Find the tree with the
lowest parsimony score - Small Parsimony problem Given a tree, find its
parsimony score - Use the small parsimony problem to solve the
large parsimony problem
12Algorithms for Small Parsimony
- Fitchs algorithm
- Based on set operations
- Evolutionary steps have the same weight
- Sankoffs algorithm
- Based on dynamic programming
- Allows steps to have different weights
- Both algorithms compute the minimum (weighted)
number of steps a tree requires at a given site
13Fitchs Algorithm Example
14Sankoffs Algorithm
- Allows for different weights for different
evolutionary steps - Transitions (A lt-gt G or C lt-gt T) are more
probable than transversions, so give a lower
weight to transitions
15Sankoffs Algorithm Example
16Sankoffs Algorithm Traceback
17Searching for a Most Parsimonious Tree
- Solving the large parsimony problem requires
searching all possible trees (or does it?) - Exhaustive search (exact)
- Branch-and-Bound (exact)
- Heuristic search methods (not exact)
18Exhaustive Search
- Build the only possible unrooted tree for three
taxa (can be randomly chosen) - Try all possible places to add the fourth taxon
and score each tree - Try all places to add the fifth taxon to the
trees and score again
19Why Finding a True Tree is Difficult
Number of rooted trees
- The number of possible trees grows exponentially
with the number of species (or sequences) - Nr (2n -3)!/2(n-2)(n-2)!
- Nu (2n -5)!/2(n-3)(n-3)!
- To find the best tree, you must explore all
possibilities (or must you?)
20Adding the Fourth Taxon
21Adding the Fifth Taxon
22(No Transcript)
23Branch and Bound
- Similar to exhaustive search except that we
maintain the score of best tree obtained so far - If score of current tree exceeds the current best
score, backtrack and take next available path - Main idea The parsimony score of a tree can
only increase as we add another taxa
24Branch and Bound
- When a tip of the search tree is reached the tree
is either optimal (and retained) or suboptimal
(and rejected) - When all paths leading from the initial 3 taxon
tree have been explored, the algorithm
terminates, and all most parsimonious trees will
have been identified
25Branch and Bound
26Branch and Bound
- One way to find a reasonable lower bound quickly
- Use UPGMA or NJ to build a complete tree
- Calculate the parsimony score of this tree and
use it as a lower bound in our search
27Heuristic Search
- Shortcuts have been designed to reduce the search
space - Idea Build a tree quickly (by NJ or some other
fast method) and rearrange parts of it to explore
some of the possible trees - Branch swapping
- Nearest neighbor interchange
- Subtree pruning and regrafting
- Tree bisection and reconnection
28Nearest-Neighbor Interchange
29Subtree Pruning and Regrafting
30Tree Bisection and Reconnection
31Stepwise Addition Another Heuristic
- A greedy method
- Start with 3 taxon tree
- Add one taxon at a time
- Keep only the best tree found so far
- No guarantee of optimality, but may provide a
good starting point for a search
32Maximum Likelihood Method
- ML is based on a Markov model of evolution
- Observed The species labeling the leaves
- Hidden The ancestral states
- Transition probabilities The mutation
probabilities - Assumptions
- Only mutations are allowed
- Sites are independent
33Models of Evolution at a Site
- Transition probability matrix
- M mij, i,j A,C,T,G
- Where
- mij Prob(i -gt j mutation in 1 time unit)
- Branches may have different lengths
34The Probability of an Assignment
T
G
T
A
G
C
T
Probability mTG mGA mGG mTT mTC mTT
35Ancestral Reconstruction Most Likely Assignment
X
Y
Z
A
G
C
T
L maxX,Y,Z mXY mYA mYG mXZ mZC mZT
Compute using Viterbi algorithm
36Likelihood of a Tree
X
Y
Z
A
G
C
T
L ??X,Y,Z mXY mYA mYG mXZ mZC mZT
Compute using forward algorithm
37Maximum Likelihood Comments
- ML is robust
- ML converges to the correct answer as more data
is added - Can put in a Bayesian statistical framework to
obtain a distribution of possible phylogenies - ML can be slow
38Phylogenetic Tree Evaluation
- Bootstrapping
- Jackknifing
- Bayesian Simulation
- Statistical difference tests (are two trees
significantly different?) - Kishino-Hasegawa Test (paired t-test)
- Shimodaira-Hasegawa Test (?2 test)
39Bootstrapping
- A bootstrap sample is obtained by sampling sites
randomly with replacement - Obtain a data matrix with same number of taxa and
number of characters as original one - Construct trees for samples
- For each branch in original tree, compute
fraction of bootstrap samples in which that
branch appears - Assigns a bootstrap support value to each branch
- Idea If a grouping has a lot of support, it
will be supported by at least some positions in
most of the bootstrap samples
40Bootstrapping Comments
- Bootstrapping doesnt really assess the accuracy
of a tree, only indicates the consistency of the
data - To get reliable statistics, bootstrapping needs
to be done on your tree 500 1000 times, this is
a big problem if your tree took a few days to
construct
41Jackknifing
- Another resampling technique
- Randomly delete half of the sites in the dataset
- Construct new tree with this smaller dataset, see
how often taxa are grouped - Advantage sites arent duplicated
- Disadvantage again really only measuring
consistency of the data
42Bayesian Simulation
- Using a Bayesian ML method to produce a tree
automatically calculates the probability of many
trees during the search - Most trees sampled in the Bayesian ML search are
near an optimal tree
43Phylogenetic Programs
- Huge list at
- http//evolution.genetics.washington.edu/phylip/so
ftware.html - PAUP - one of the most popular programs,
commercial, Mac and Unix only, nice user
interface - PHYLIP free, multiplatform, a bit difficult to
use but web servers make it easier - WebPhylip another interface for PHYLIP online
44Phylogenetic Programs
- TREE-PUZZLE uses a heuristic to allow ML on
large datasets, also available as a web server - PHYML web based, uses genetic algorithm
- MrBayes Bayesian program, fast and can handle
large datasets, multiplatform download - BAMBE web based Bayesian program
45Final Comments on Phylogenetics
- No method is perfect
- Different methods make very different assumptions
- If multiple methods using different assumptions
come up with similar results, we should trust the
results more than any single method