Loading...

PPT – IE68 Biological databases Phylogenetic analysis PowerPoint presentation | free to download - id: 1f5b1e-OGYwZ

The Adobe Flash plugin is needed to view this content

IE68 - Biological databasesPhylogenetic analysis

Phylogenetic analysis

- Phylogeny
- a reconstruction of the evolutionary

(genealogical) history of a group of

organisms/genes or proteins from biological data - organisms populations, species, genera,... gt

taxa gt operational taxonomic units (OTUs) - data molecular, morphological,

archaeological,... gt characters - Phylogenetic tree
- the graphical reconstruction of a phylogeny
- tree structure phylogram, cladogram

Phylogenetic tree

A tree consists of nodes connected by branches

polytomy

A B C D E

gt OTUs for which we have data

outgroup/midpoint

gt Ancestor of all the taxa that comprise the tree

notation ((A,B),(C,D,E))

Phylogenetics ltgt Phenetics

- Phenetics method of grouping taxa that is based

on overall (dis)similarities of characters gt

with no reference to evolution! - Phylogenetics method of grouping taxa that is

based on shared derived characters

(synapomorphies) or a model of evolution

Why do we need phylogenies?

- Intrinsic interest in the tree gt tree of life
- origin of organisms

Why do we need phylogenies?

- Phylogenies can also be used as tools for

investigating other problems - e.g. biogeography
- phylogeny reflects the order of separation of

the areas the different taxa occupy

T

Why do we need phylogenies?

- Phylogenies can also be used as tools for

investigating other problems - e.g. forensic science

(No Transcript)

Phylogenetic analysis

- Molecular Phylogenetics
- reconstruction of the evolutionary (geneological)

history of a group of organisms from molecular

data, i.e. DNA or protein sequences - In this lecture, we will focus on phylogenetic

analysis of organisms based on DNA sequence data

Molecular phylogenetics approach

- Step 1 PCR with primers that target cytoplasmic

DNA or nuclear loci of taxa, followed by DNA

sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis

PCR and DNA sequencing

- Which loci?
- DNA sequence information, primers, variability,

single or low-copy, orthologous, neutral,

recombination... - Gene trees versus organismal trees
- phylogenies for genes do not always match those

of their corresponding organisms gt analyse more

than one gene

Confounding influence of gene duplication

2 types of homology orthology (speciation) and

paralogy (gene duplication)

Lineage sorting and coalescence

species alleles

Molecular phylogenetics approach

- Step 1 PCR with primers that target cytoplasmic

DNA or nuclear loci of taxa, followed by DNA

sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis

Multiple DNA sequence alignment

- Problem alternative alignments
- possible to align any two sequences by

postulating some combination of gaps

(insertion/deletions indels) and substitutions - gt which one to choose?
- Basic task of sequence alignment is to find the

alignment with the highest similarity, smallest

distance, or lowest overall cost

Multiple DNA sequence alignment

- 2 sequences scoring scheme gt optimal alignment
- Scoring scheme
- - scoring matrix distance weights or similarity

scores for each pair of aligned bases e.g.

transition transversion matrix - A T G C
- A 0 5 1 5
- T 5 0 5 1
- G 1 5 0 5
- C 5 1 5 0
- - gap weight, cost or penalty

Multiple DNA sequence alignment

- Cost of an alignment D s wg
- s no of substitutions, g total length of

gaps - w gap penalty cost of gap relative to

substitution - Gap penalty W makes implicit assumptions about

how the sequences have evolved - if indels are thought to be rare, then W should

be large (and vice versa) - gt have to use knowledge of biology e.g.

translation (3 bp indel, position),

transitionltgttransversion, ...

Multiple DNA sequence alignment

- Software programs e.g. CLUSTALW (global

alignment) - http//www.ebi.ac.uk/clustalw/index.html
- The optimal alignment is not always the true

alignment gt new developments phylogenetic

analysis without the multiple DNA sequence

alignment step

Molecular phylogenetics approach

- Step 1 PCR with primers that target cytoplasmic

DNA or nuclear loci of taxa, followed by DNA

sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis

Inferring phylogenies from DNA sequences

C

Sequence alignment A ..AGCGTCT..B

..AGCGTGT..C ..AGGAGT..

A

B

Phylogenetic methods

unrooted tree

A

B

taxa

characters

C

rooted tree

Phylogenetic methods

Character-based methods

Non character-based methods

Methods based on an explicit model of evolution

Maximum-likelihood methods

Pairwise distance methods

Methods not based on an explicit model of

evolution

Maximum parsimony methods

Pairwise distance methods

3 taxa, 3 sequences

- Dissimilarity matrix count the number of

differences between all possible pairs of

sequences - Convert dissimilarity to evolutionary distance by

correcting for multiple events per site according

to a certain model of evolution - Infer tree topology on the basis of the

evolutionary distances by using a clustering

algorithm or optimality criterion

1 2 31 2 0.263 0.20 0.33

1 2 31 2 0.323 0.23 0.44

tree

Models of sequence evolution

expected ? observed difference gt correction

(linear) (not linear)

Apply a substitution model that tries to estimate

the correct number of substitutions

Models of sequence evolution

- Distance correction methodsconvert observed

distances into measure that correspond to ACTUAL

distance - Several methods have been proposed, all with

different assumptions about the nature of the

evolutionary process - Essentially they differ by the number of

parameters they include - We can use a general framework to show how these

models are inter-related

Substitution models general framework

Substitution models general framework

e.g. Model of Jukes Cantor (JC)

- One of the first proposed perhaps the simplest

model of evolution - Assumes that all four bases have equal frequency

and that all substitutions are equally likely - Under this model, the distance between any two

sequences is given by d -3/4ln(1-4/3p), where p

is the proportion of nucleotides that are

different in the two sequences

e.g. Kimura 2 parameter model (K2P)

- incorporates the observation that transitions

accumulate more rapidly than transversion - assumes all four bases have equal frequencies

but that there are 2 rate classes for

substitutions - Under this model, the distance between any two

sequences is given by d 1/2ln1/(1-2P-Q)

1/4ln1/(1-2Q), where P and Q are the

proportional differences between the two

sequences due to transitions and transversions,

respectively

Substitution models

- Other models adding more parameters
- Felsenstein model (F81)
- variation in base composition gt base frequency

f ?A ?C ?G ?T may vary - Hasewaga Kishino Yano (HKY) model
- unequal base frequency, transition/transversion
- General reversible model (REV) unequal base

frequency, all six pairs of substitutions have

different rates - gt ideally, we want the simplest model we can get

away with that still yields a reasonable

estimate

Substitution models

- Assumptions of these models
- all nucleotide sites change independently
- base composition equilibrium
- substitution rate is constant over time and in

different lineages - each site in a sequence is equally likely to

undergo substitutiongt gamma distribution has a

parameter that specifies the range of rate

variation among sites model ?

- Pairwise distance methods
- Dissimilarity matrix count the number of

differences between all possible pairs of

sequences - Convert dissimilarity to evolutionary distance

by correcting for multiple events per site

according to a certain model of evolution - Infer tree topology on the basis of the

evolutionary distances by using a clustering

algorithm

3 taxa, 3 sequences

1 2 31 2 0.263 0.20 0.33

1 2 31 2 0.323 0.23 0.44

tree

Clustering methods

- Clustering methods follow a set of steps (an

algorithm) and arrive at a tree - UPGMA (Unweighted Pair Group Method using

Arithmetic Averages) results in an rooted and

additive tree with molecular clock - Neighbor-joining results in an unrooted and

additive tree - Other approaches least-squares, Fitch, Kitch,...

UPGMA clustering

A B C B 2 least differences C 4 4 D 6

6 6

1

A

1

B

Compute new distances between (AB) and other

OTUs d(AB)C (dAC dBC) /2 4 d(AB)D (dAD

dBD) /2 6

UPGMA clustering

1

A

AB C C 4 D 6 6

1

1

B

2

C

1

A

1

Compute new distances between (ABC) and other

OTUs d(ABC)D (d(AB)D dCD) /2 6

1

B

1

2

C

3

D

Clustering methods

- UPGMA additive and ultrametric distancesgt

assumes a molecular clock gt very sensitive to

unequal rate of evolution! gt relative-rate test - Use other clustering methods for phylogenye.g.

Neighbor-joining - Goodness of fit statistics to select the

metric tree that best accounts for the observed

distances

- Pairwise distance methods
- Dissimilarity matrix count the number of

differences between all possible pairs of

sequences - Convert dissimilarity to evolutionary distance

by correcting for multiple events per site

according to a certain model of evolution - Infer tree topology on the basis of the

evolutionary distances by using an optimality

criterion

3 taxa, 3 sequences

1 2 31 2 0.263 0.20 0.33

1 2 31 2 0.323 0.23 0.44

tree

Minimum evolution

- Distance matrix gt unrooted metric trees
- Each tree has a length L, which is the sum of all

the branch lengths - Optimality criterionthe minimum evolution tree

ME is the tree which minimizes L

Pairwise distance method

- Advantages
- very fast
- based on a model of evolution
- Disadvantages
- sequence information is reduced to one number
- branch lengths may not be biologically

interpreted - most methods provide only one tree topology
- dependent on the model of evolution used

Phylogenetic methods

Character-based methods

Non character-based methods

Methods based on an explicit model of evolution

Maximum-likelihood methods

Pairwise distance methods

Methods not based on an explicit model of

evolution

Maximum parsimony methods

Character-based methods

- Character-based (discrete) methods operate

directly on sequences, rather than on pairwise

distances - Two major discrete methods
- Maximum parsimony (MP) chooses tree(s) that

require fewest evolutionary changes - Maximum Likelihood (ML) chooses tree(s) that is

the one most likely to have produced the observed

data

Maximum parsimony

- Maximum parsimony infers a phylogenetic tree by

minimizing the total number of evolutionary steps - Principle
- Investigate all possible tree topologies
- Reconstruct ancestral sequences
- Choose topology with smallest number of steps

Maximum parsimony - principle

A

1

3

2

4

1

2

B

3

4

1

2

C

3

4

possible tree topologies

Maximum parsimony - principle

Maximum parsimony - principle

Maximum parsimony - principle

Maximum parsimony - generalized

- In previous example, cost of each substitution

was one step gt equal weight - Instead, we can use different costs for different

types of change (e.g. transitions vs

transversions) to better match our assumptions

about evolutionary processes gt weighted

parsimonyaccording to Dollo, Wagner, Fitch, ...

Maximum parsimony - characters

Maximum parsimony search methods

- Number of tree topologies Nu

(2n-5)!/2n-3(n-3)!i.e., 3 sequences 1 tree, 4

seq 3 trees, 5 seq 15, 6 105, gt the more

sequences ( taxa), the more trees gt

computationally expensive - Finding optimal trees
- Exhaustive search limited number of taxa

(lt10)find the minimum tree of all possible trees - Branch and bound small number of taxa (lt18)find

the minimum tree without evaluating all trees by

discarding families of trees during tree

construction that cannot be shorter than the

shortest tree found so far - Heuristic search large number of taxa

Maximum parsimony search methods

- Heuristic searchexplore a subset of all

possible trees, by using stepwise addition of

taxa plus a rearrangement process (branch

swapping), but not guaranteed to find the minimal

tree

Global optimum

Local optimum

Maximum parsimony - output

- Consensus treeMP can yield multiple equally

most parsimonious (optimal) trees gt

relationships common to all the optimal trees are

summarized with a consensus tree - Strict consensus includes splits found in all

trees - Majority-rule consensus includes splits found in

the majority of the trees (gt 50)

Maximum parsimony - output

- Consistency index (CI) - Retention index (RI)
- measures of the parsimony fit of a character to a

tree, or of the average fit of all characters to

a tree - more specifically index of how much homoplasy

the constructed tree has - Value from 0 to 1
- higher value gt less homoplasy

(No Transcript)

Parsimony branch support and tree stability

- Bootstrap analysis
- is a resampling technique used to measure

sampling error - gives an idea about the reliability of branches

and clusters - original dataset gt resample gt construct trees

gt compare trees to original trees - gt70 quite confident of tree topology
- Decay index (Bremer support)
- gives us a sense of how many steps would be

required before a grouping collapses - higher value gt better branch support

Maximum parsimony

- Advantages
- based on shared derived characters
- evaluates different tree topologies
- does not reduce the information
- Disadvantages
- computationally intensive for large datasets
- no correction for multiple mutations
- sensitive to unequal rates of evolution (long

branch attraction)

Phylogenetic methods

Character-based methods

Non character-based methods

Methods based on an explicit model of evolution

Maximum-likelihood methods

Pairwise distance methods

Methods not based on an explicit model of

evolution

Maximum parsimony methods

Maximum likelihood

- Statistical method
- If given some data D and a hypothesis H, the

likelihood of that data is given byLD Pr (DH) - Which is the probability of D given H?

Maximum likelihood

- In the context of molecular phylogenetics
- D is the set of sequences being compared
- H is a phylogenetic tree
- We want to find the likelihood of obtaining the

observed data given the tree - The tree that makes the data the most probable

evolutionary outcome is the Maximum Likelihood

estimate of the phylogeny

Maximum likelihood

- In other wordsWhich tree is most likely to have

yielded these sequences (observed data) under a

given model of evolution (JC, K2P, ...)?

Maximum likelihood

- Advantages
- Statistically well founded
- Based on a model of evolution
- Evaluates different topologies
- Uses all sequence information
- Often yields estimates that have lower variance

than other methods - Disadvantages
- Very slow (computationally intensive)
- Dependent on the model of evolution used

Software programs for phylogenetic analysis

- Overview http//evolution.genetics.washington.edu

/phylip/software.html - Most widely used software programs
- PHYLIP free available (downloadable or online

http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-

uk.html) - PAUP user friendly but not free available

Phylogenetic information on the internet

- http//tolweb.org/tree/phylogeny.html
- http//www.treebase.org/treebase/
- ....

If you need more information

- Jacqueline Vander Stappen
- K.U.Leuven
- Laboratory of Gene Technology
- Kasteelpark Arenberg 21
- B-3001 Leuven
- Jacqueline.vanderstappen_at_agr.kuleuven.ac.be