Phylogenetic Analysis - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Phylogenetic Analysis

Description:

Phylogenetic Analysis Introduction to bioinformatics Stinus Lindgreen stinus_at_binf.ku.dk Bioinformatics Centre, University of Copenhagen Outline of the lecture What is ... – PowerPoint PPT presentation

Number of Views:1900
Avg rating:3.0/5.0
Slides: 59
Provided by: Stin9
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Analysis


1
Phylogenetic Analysis
  • Introduction to bioinformatics
  • Stinus Lindgreen
  • stinus_at_binf.ku.dk
  • Bioinformatics Centre, University of Copenhagen

2
Outline of the lecture
  • What is a phylogeny?
  • Why and how to interpret them
  • Programs PHYLIP, PAUP and BioEdit
  • Building a tree 1 Multiple alignment
  • Building a tree 2 The model
  • Building a tree 3 Construction
  • Building a tree 4 Evaluation

3
  • Nothing in Biology Makes Sense Except in the
    Light of Evolution
  • Theodosius Dobzhansky (1900-1975)

4
Phylogeny
  • Phylogenetic inference predicts a tree based on
    characters (of some sort)
  • Some variation needed
  • Group together similar species/genes
  • Connect to most common ancestor
  • Unrooted tree Just show connections
  • Rooted tree Direction of evolution
  • Branch lengths can show divergence

5
Before sequences
  • Phylogenetic trees show evolutionary
    relationships
  • Existed longer than sequencing methods
  • Previously based on morphological characters
  • Still partly today at least for checking
  • Mainly based on biological sequences
  • DNA or protein
  • Base phylogeny on mutations

6
Morphological tree
7
Modern tree
  • A A G C
    G

X
X
8
Some pitfalls
  • Determining phylogeny is important for
    understanding biology
  • But also a very difficult problem
  • Beware of incorrect trees
  • Important to understand models and methods
  • The programs are helpful tools
  • The result is only as good as the alignment

9
Assumptions
  • Basic concepts of evolutionary theory
  • Relation to common ancestor
  • Phylogenetics represented by bifurcating tree
  • Mutations occur over evolutionary time
  • Necessary to make phylogenetic inference possible

10
Tree of Life
11
Interpretation
  • Know your model
  • Both evolutionary and for tree construction
  • Know the assumptions of the model
  • Evolution independent? Identical between sites?
    The same for all sequences?
  • Are the sequences correct?
  • And are they representative?
  • And are they homologous?
  • Is the multiple alignment correct?
  • What you get out is no better than what you put in

12
Some biological pitfalls
  • Dont make hasty conclusions!
  • Does your tree contradict common sense?
  • Then its probably wrong!
  • Differentiate between the homologs
  • Orthologs
  • Speciation, common ancestor, similar function
  • Paralogs
  • Gene duplication, within 1 organism, differing
    functions
  • Xenologs
  • Horizontal gene transfer hard to tell, similar
    function

13
Software
  • Today well look at the programs before the
    methods
  • Some programs for phylogenetic analysis
  • A multiple alignment program
  • Clustal, T-Coffee, MAFFT, Muscle
  • A phylogenetic program
  • Phylip, PAUP, MacClade, BioEdit
  • Visualizing the tree
  • TreeView, NJplot

14
PAUP
  • Commercial package
  • Apparently good
  • Many different methods and analysis methods
  • But since we dont own a copy
  • Similarly MacClade only works on Macintosh

15
PHYLIP
  • Free package
  • Many programs
  • Both distance and character based
  • Bootstrapping possible
  • But
  • It can be a little difficult
  • No graphical user interface
  • And you will need to run many programs

16
BioEdit
  • Has phylogeny methods built in
  • Can call Phylip routines
  • No need for you to learn the command line
  • But no bootstrapping (as far as I know)
  • Point and click
  • Select the sequences in the alignment
  • Choose the wanted phylogeny
  • Voila!

17
PhyloWin
  • Another free program
  • Simple, not many possibilities
  • But you can make bootstrapping

18
Getting the software
  • Install BioEdit, PHYLIP, PhyloWin and NJplot
  • Links on the wiki

19
Constructing a tree
  • To make a phylogenetic tree, four steps are
    needed
  • Perform multiple alignment
  • Choose your model
  • Build the tree
  • Evaluate the quality
  • A brief note
  • Ideally Parallel alignment and phylogenetic
    inference
  • Very difficult but it has been pursued

20
1) The multiple alignment
  • Already discussed
  • Some notes
  • Recall that MA programs are not exact
  • Some manual editing often necessary
  • Consider the algorithm used
  • Does it consider the phylogeny of the data?
  • Clustals guide tree Not correct phylogeny
  • What parameters are used?
  • Solve ambiguities, remove near-identical
    sequences
  • Gappy regions, identical sequences can bias the
    result

21
2) The model
  • The model describes the data
  • Evolutionary events
  • Overall mutability
  • Evolutionary model?
  • Crucial both for alignment and tree building
  • Are you looking at nucleotides or amino acids?
  • Where do we get most information?
  • Know the basis for the chosen model

22
Nucleotide models
  • Create 44 matrix
  • Either fixed cost
  • Character state
  • Or rate matrices
  • Probabilities
  • Used for different kinds of tree estimations
  • Include site specific information
  • Third codon position more variable

23
Nucleotide model 1
  • Fixed cost for transitions and transversion
  • E.g. transversions are twice as costly as
    transitions
  • For a tree Count the number of
  • transitions/transversions
  • Calculate cost
  • Tends to minimize number of
  • transversion
  • Cluster transitions

A C G T
A - 2 1 2
C 2 - 2 1
G 1 2 - 2
T 2 1 2 -
24
Nucleotide model 2
  • Simple substitution rate matrix
  • Assume same rates A?B and B?A
  • Assume all mutations equally likely Rate a
  • The Jukes-Cantor model

A C G T
A -3a a a a
C a -3a a a
G a a -3a a
T a a a -3a
25
Nucleotide model 3
  • More advanced rate matrix
  • Include transitions/tranversions
  • Rates a1 and a2
  • The Kimura 2-parameter model

A C G T
A -(a2a1) a2 a1 a2
C a2 -(a2a1) a2 a1
G a1 a2 -(a2a1) a2
T a2 a1 a2 -(a2a1)
26
Amino acid models
  • A 2020 substitution matrix
  • The BLOSUM matrices
  • Fixed cost matrices
  • Or the PAM matrices
  • Rate matrices
  • Described last week

27
3) Building the tree
  • We have the sequences, the alignment and the
    model
  • Find the best tree
  • What is the best tree?
  • Two main strategies
  • Distance based
  • Look at dissimilarities (distances)
  • Character based
  • Look at the data

28
Problems with trees
  • The number of possible trees grows exponentially
  • For 15 taxa 2.131014 possibilities
  • How to search?
  • Branch and Bound
  • Branch swapping
  • Rooting the tree
  • Not a simple problem
  • All the following methods produce unrooted trees
  • Use an outgroup
  • Midpoint of longest branch

29
Distance methods
  • Some sequences more similar than others
  • Closely related sequences should be close in the
    tree
  • Abstract view on the data
  • Loss of information is usually a bad sign
  • Only use the distances between sequences
  • Recall Clustal
  • All methods start with a distance matrix

30
Distance methods
  • Can we get the correct answer?
  • Yes, if all mutation events were present
  • But After one mutation, the site is saturated
  • Additional mutations do not give additional info
  • A B C Distance 2
  • A C Distance 1
  • And mutations back will fool the method
  • A B A Distance 2
  • A A Distance 0

31
UPGMA
  • Unweighted Pair Group Method with Arithmetic Mean
  • Unweighted The distances are used as they are
  • Pair Find the two closest elements
  • Group Put them together in a new group
  • Arithmetic Mean Gives distances from the new
    group
  • Correct tree assuming a molecular clock
  • Evolutionary divergence time can be found from
    mutations
  • Mutation rates are constant

32
UPGMA illustrated
  • Find two closest A and D
  • Create a new group AD
  • Update distances

A B C D E
A - 8 3 2 5
B - - 5 6 6
C - - - 7 5
D - - - - 3
E - - - - -
AD B C E
AD - 7 5 4
B - - 5 6
C - - - 5
E - - - -
  • Repeat for all sequences
  • Next time Connect AD with E

33
Trying UPGMA
  • Go to the wiki and do the UPGMA exercise

34
Neighbour joining
  • A little like UPGMA
  • Difference NJ does not assume a molecular clock
  • But it assumes an additive tree
  • Distance between two leaves is the sum of the
    edges
  • Find the closest pair that is most apart from the
    rest of the tree
  • Connect pair and update distances
  • A little advanced Take the overall distance to
    the rest of the tree into account
  • Corrects for varying mutation
  • Fast and can give good results

35
Fitch-Margoliash
  • FM method
  • We have the pairwise distances
  • Each branch in the tree has a length
  • The length of all paths can be found
  • Optimize tree by moving internal nodes around
  • The best fit minimizes the overall error
  • The minimum squared deviation

36
Minimum Evolution
  • The ME method
  • Find the shortest tree
  • Count number of changes
  • Similar to FM but only looks at branches
  • FM

B
ME
B
A
A
37
Trying NJ
  • Go to the wiki and do the NJ exercise

38
Character methods
  • Use the data (the actual characters)
  • All information at hand
  • More advanced, slower, but also more accurate
  • Maximum Parsimony (MP)
  • Occams razor Simplest explanation
  • Maximum Likelihood (ML)
  • Advanced statistical method
  • Most probable tree given the data and the model

39
Maximum parsimony
  • How does evolution work?
  • Assumption Path of least resistance
  • True evolution gives rise to fewest changes
  • The tree we want
  • Describe the given sequences by fewest changes
  • The ancestral nodes must be as similar as
    possible
  • Predict a tree
  • Count the number of changes needed

40
MP illustrated
  • A C G G C
  • A,C G
  • A,C,G
  • C

41
MP illustrated
  • A C G G C
  • A,C G
  • A,C,G
  • C

X
X
Cost 2 changes
42
MP illustrated
  • A C G G C
  • C G
  • C
  • C

C?A
C?G
43
Maximum Likelihood
  • Given the data, predict the most probable model
  • Can optimize both tree and substitution model
  • We know the sequences
  • What is the most likely substitution rates?
  • Estimate from the alignment (and the phylogeny)
  • And what is the most likely tree?
  • Estimate from alignment and substitution rates
  • Computationally heavy and rather slow
  • Normally good results

44
Maximum Likelihood
  • General practice Optimize model then tree
  • Calculate probability for each alignment column
  • Combine to probability for entire alignment
  • Averages over low and high probability sites
  • Likelihood of column given tree

A A C A A
A A C C A
A A C G A
LP
P
P

45
Maximum likelihood
  • Then repeat this for all possible tree topologies
  • And all possible assignments to internal nodes
  • And then choose the combination that gives the
    highest probability
  • Clearly very difficult

46
MP and ML exercise
  • Go to the wiki and do the MP and ML exercises

47
Summary of methods
Distance Character based
Clustering UPGMA Neighbour Joining
Optimality criterion Least Squares Minimum evolution Maximum parsimony Maximum likelihood (Bayesian statistics)
48
The differences
  • Sometimes the differences can seem minimal
  • They affect the tree but the same result is
    possible
  • UPGMA and NJ
  • Minimize the overall length of the tree
  • Maximum parsimony
  • Finds tree with fewest changes
  • Maximum likelihood
  • Maximizes the probability of the tree given the
    data

49
4) Evaluating trees
  • How good is the predicted tree?
  • Some sequence variation needed
  • Is the signal strong enough?
  • There are so many possible trees
  • Are there many trees similar to the prediction?
  • Which one to choose?
  • Is the tree robust?
  • Does it change much when e.g. removing a sequence?

50
Randomization
  • Is it possible that tree is just random?
  • Permute the columns of the alignment
  • i.e. shuffle the characters in a column
  • Build a new tree
  • Is it (partly) identical?
  • If the tree is just as likely to be random, then
    dont put too much faith in it

51
Bootstrapping
  • The story of Baron von Münchausen
  • He pulled himself out of a swamp by his
    bootstraps
  • The idea Evaluate the quality of the result
    using the same data all over again
  • Make a large number of new datasets
  • Create phylogenetic tree
  • Observe the number of times clades are made

52
Bootstrapping
  • The datasets should be similar
  • Thereby The trees are comparable
  • Alignments of same size (length and sequences)
  • Non-parametric Sample with replacement
  • Choose a random column and add new alignment
  • Parametric Simulate new datasets
  • Use model that look like your data
  • Characteristics are preserved (unlike
    randomization)

53
Bootstrap example
  • Non-parametric bootstrapping
  • We have an alignment
  • A A G G C U C C A A A
  • B A G G U U C G A A A
  • C A G C C C C G A A A
  • D A U U U C C G A A C
  • 0 1 2 0 3 0 1 2 0 1
  • Sample columns
  • A G G G U U U C A A A
  • B G G G U U U G A A A
  • C G C C C C C G A A A
  • D U U U C C C G A A C

A B C D
A - - - -
B 1 - - -
C 5 5 - -
D 8 7 4 -
A B C D
54
Bootstrap example
  • Sample 2
  • A A U U C C C C A A A
  • B A U U C C G G A A A
  • C A C C C C G G A A A
  • D A C C C C G G C C C

A B C D
A - - - -
B 2 - - -
C 4 2 - -
D 7 5 3 -
A B C D
55
Bootstrap example
  • Sample 3
  • A A C C C A A G G C C
  • B A C C G A A G G U U
  • C A C C G A A C C C C
  • D A C C G C C U U U U

A B C D
A - - - -
B 3 - - -
C 3 4 - -
D 7 4 6 -
A B C D
56
Bootstrap example
  • Calculate consensus tree
  • Can be done on many ways
  • Put the bootstrap number at each branch point
  • The proportions of times this branch is observed
  • Of course, more than three samples needed

A B C D
1.0
0.66
57
Bootstrapping exercise
  • Do the bootstrapping exercise on the wiki

58
Summary
  • What is phylogenetic inference?
  • What can a phylogenetic tree be used for?
  • Be aware of the multiple alignment
  • The different models
  • Tree building methods NJ, UPGMA, ML and MP
  • Evaluating trees Bootstrapping
  • Programs Phylip, PAUP,PhyloWin and BioEdit
  • Next time Gene finding (with Anders Krogh)
  • Then RNA structure prediction with me again ?
Write a Comment
User Comments (0)
About PowerShow.com