Inferring phylogenetic trees from unaligned molecular sequences - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Inferring phylogenetic trees from unaligned molecular sequences

Description:

Patterns: A.EM, ACAD.MIA. The two patterns 'align' residue E in X differently ... neighbour-joining or Fitch-Margoliash. Un-aligned homologous. sequences ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 62
Provided by: wwwmath
Category:

less

Transcript and Presenter's Notes

Title: Inferring phylogenetic trees from unaligned molecular sequences


1
Inferring phylogenetic trees from un-aligned
molecular sequences
Presented by Mark Ragan based on work by Michael
Höhl
ARC Centre of Excellence in Bioinformatics and
Institute for Molecular Bioscience The
University of Queensland
2
Accepted approaches to phylogenetic inference
involve multiple sequence alignment
Distancemethods
Distances
Un-aligned homologous sequences
MSA
Tree(s)
Maximum parsimony Maximum likelihood Bayesian
inference
3
A multiple sequence alignment is a hypothesis of
homology at each and every sequence position
(alignment column)
4
But multiple sequence alignment is hard
NP-hard, in fact Hence diversity of heuristic
approaches (local vs global, anchored, tree-based
etc.) How to estimate parameters ?
and available data are getting much bigger
5
Can we skip multiple sequence alignment?
?
Un-aligned sequences
Tree(s)
Alignment-free methods
6
What advantages might we hope to realise by
taking an alignment-free approach?
  • At minimum, we could eliminate the NP-hard MSA
    step (and perhaps replace it with something more
    computationally tractable) for analysis of
    nucleotide and protein datasets
  • We should be able to make inferences based
    directly on rearranged, recombined, shuffled
    permuted sequences
  • Perhaps we can also make better inferences from
    incomplete and/or noisy sequences
  • If results are promising, there may be scope for
    further improvement (unlike MSA, which has been
    so well-studied that most future improvements
    will probably be marginal)

7
Proteins with circular permutationsin ProDom
Pfam
From J Weiner III E Bornberg-Bauer, MBE
23734-743 (2006)
8
Two key steps in alignment-free approaches (lots
more detail to follow !)
1 Extract homology information from a set of
homologous but un-aligned sequences 2 Then use
this information to infer trees
9
Types of information we can extract
Shared words (k-mers) Shared patterns Features
(e.g. lengths) of shared sub-strings Complexity
measures (from information theory) (Non-sequence
information, e.g. structure)
To what extent are these homology information?
10
Step 1 Extracting homology information from a
set of homologous but un-aligned sequences
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
11
Words in sequences
  • Alphabet A with c characters
  • For example, c 20 for amino acids in proteins
  • w ck different words (k-mers) of length k
  • 201 20 different 1-mers (A, , R, ..., V)
  • 202 400 different 2-mers (AA, AR, ..., RA,
    ..., VV)
  • 203 8000 different 3-mers (AAA, AAR, ..., VVV)

Words (unlike patterns, introduced next) are
non-degenerate
12
Finding words in sequences
  • Example 3-mers
  • M A C A D A M I A1 2 31 2 31 2 3 MAC,
    ADA, MIA, 1 2 31 2 31 2 ACA, DAM,
    1 2 31 2 31 CAD, AMI
  • There are L-k1 words (occurrences)
  • L length of sequence
  • In the example above 9-31 7 words
  • Finding them is fast O(L)

13
Generalising words patterns (1) Elementary
patterns
L matching residues in window of width
W Example L 3, W 3 L 2, W
3 X ..ILM.... ..IVM.....Y
....ILM.. ...ILM..
ILM I. M The pattern must occur
in at least K 2 sequences We discover patterns
using the Teiresias algorithm (Rigoutsos
Floratos 1998) O W L m log m W m,
where m size of input set
14
Generalising words patterns (2) Maximal patterns
  • Example G.RE REA. EA.T G.REA.T
    (L 3, W 4)for X ...GrREAaT Y
    GgREAtT.....

15
False positive (non-homologous) patterns
  • Short degenerate patterns can occur by chance
    (i.e. non-homologously) multiple times in a
    sequence
  • Response filter patterns
  • Remove patterns with gt 1 instance in a sequence
  • ExampleX ..BATH...BATH...Y .....BOTH.......
  • Reduces false positives (and some true
    positives!)
  • A variant of this filtering will be introduced
    later

16
Converting sets of words (k-mers) or patterns to
distances and trees
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
17
(Squared) Euclidean distance
  • dE Blaisdell (1986)
  • Based on words
  • Example
  • p1(x1,y1) p2(x2,y2)
  • len(p1,p2) v( (x1-x2)2 (y1-y2)2 )
  • dE(p1,p2) (x1-x2)2 (y1-y2)2
  • Sequences instead of points
  • k-mer counts are coordinates()
  • X (AAA0, ..., MAC1, ADA1, MIA1, ..., VVV0)
  • () if sequences are modelled as Markov chains,
    the k-mer counts
  • are generalisable as the corresponding
    transition matrices

18
Standardised Euclidean distance
  • dS Wu, Burke Davison (1997)

  • where si are standard deviations
  • Problem words may overlap (affects variance
    probabilities)
  • Example 4-mers (start at )
  • AAAAAAAA LALALALA RACQRACQ
  • AAAA5 LALA3 RACQ2
  • To account for variance, we need equilibrium
    frequencies
  • Variance Gentleman Mullin (1989)

19
Fraction of common k-mer counts
  • dF Edgar (2004)
  • Distance based on fraction of common k-mer counts
  • Idea Similar sequences share k-mers

100 ID 50 ID X RACQ
RACQ Y RACQ RAK I Common
2-mers X RA, AC, CQ RA, AC, CQ Y RA, AC, CQ
RA, AR, K I
(e is often set to 0.1)
20
Probabilities of common k-mer counts
  • dP adapted from Van Helden (2004)
  • Distance based on probabilities of common k-mer
    counts under a multiplicative Poisson model
  • Idea weigh common words by probabilities under a
    Poisson distribution. Self-overlapping words are
    removed.
  • Equilibrium frequencies are required

21
Composition distance
  • dC Hao Qi (2004)
  • Idea describe k-mer from shorter words
    Expected counts under Markov model
  • of order k-2 E(RACQ) Observed
    counts c(RAC), c(ACQ), c(AC)
  • How different are these k-mer counts? v(RACQ)
    (c-E) / E
  • Measure angle between the two composition
    vectors v(X) (v-values for words in X) v(Y)
    (v-values for words in Y)

(Derivation of the composition vectors v isnt
shown, but goes pretty much as expected)
22
W-metric
  • dW Vinga, Gouveia-Oliveira Almeida (2004)
  • Based on frequencies f of words of length 1 (i.e.
    letters)
  • Idea weight 1-mers according to a similarity
    matrix
  • Here, we base the pairwise weights on BLOSUM62
  • Thus the approach incorporates a model of
    sequence change

23
Pattern-based approach
  • Find all instances of patterns meeting criteria
    (L, W, K)
  • Extract and concatenate these for each pair of
    sequences
  • (this necessarily yields a pair of strings of
    identical length)
  • X GrREAaTPATTeRNiNSTaNCES Y
    GgREAtTPATTiRNaNSTeNCES
  • Variant dPB-ML estimate pairwise distances by ML
    under a
  • defined model of sequence change (we use JTT)
  • Variant dPB-SIM transform similarity matrix S
    into distance
  • matrix D (Di j Si i Si j 2Si j ), then
    compute distances
  • using the BLOSUM62 matrix

24
Patterns can conflict
  • For two sequences X ...ACADEMIA...
    Y .AHEM.MACADAMIA.
  • Patterns A.EM, ACAD.MIA
  • The two patterns align residue E in X
    differently
  • (probably contradicting the notion of
    homology)
  • Position matters !

Extension to three sequences X
...ACADEMIA... Y .AHEM.MACADAMIA. Z
.........ADAM..... Patterns A.EM, ACAD.MIA,
AD.M X residue E aligns to residues
E, A, A Majority consensus is A This is the
approach taken by variant dPBMC
25
Average Common Substrings
  • dACS Ulitsky, Burstein, Tuller Chor (2006)
  • Idea sum over lengths of (maximal) common
    substrings
  • Example
  • X MACADAMIA..... 1 2
  • Y ...ACADEMIA.. 2 1
  • Step 1 M matches (length 1)
  • Step 2 ACAD matches (length 4)
  • Finding substrings is fast (suffix arrays)

where m n are lengths of X Y
26
Lempel-Ziv complexity
  • dLZ Otu Sayood (2003)
  • Not based on words or patterns
  • Idea
  • AAAAAAAAA MACADAMIA
  • simple complex
  • Related algorithmic complexity pkzip, gzip etc.

where (XY) refers to a concatenation of X and
Y and c(X) is the number of components required
to produce X
27
BAliBASE 2.0 Thompson, Plewniak Poch 1999
ff. Reference set 1 equidistant sequences at
various degrees of divergence Reference set 2
families aligned with highly divergent orphan
sequence Reference set 3 families with low
(lt25) sequence identity Reference set 4
sequences with N- and C-terminal
extensions Reference set 5 sequences with
internal insertions (Reference set 8 families
with circular permutations)
BAliBASE 3.0 is now available
28
Simple k-mer distances
dE (Euclidean distance) is sequence-length
dependent dS (standardised Euclidean
distance) and dW (W-metric) likewise show
little linearity with sequence divergence
dE
dS
dW
29
Simple k-mer distances (cont.)
dC (composition distance) is somewhat linear
over a narrow divergence range for AA, but
this is lost with CE alphabet dLZ shows limited
linearity with both AA (shown) and CE
dC, AA
dC, CE
dLZ
30
Simple k-mer distances (cont.)
dACS is linear over a wider divergence range
than is dC dF behaves much like dC, but with
greater dynamic range however, 25 of
distances undefined unless e gt 0 dP
saturates 30 of distances are1.0
(numerical instability from multiplying small
probabilities)
dACS
dF
dP
31
Do patterns perform better than words (k-mers)
?First, we must parameterise Teiresias
  • Remember L matches in window of width W over k
    sequences
  • Parameterisation using BAliBASE families
  • (all sequences ? 49 amino acids in length)

For 1 lt L lt 4, there are too many false
positives For L 4, examined W 8, 9, 16,
i.e. identity 50 to 25 All pairwise distances
are defined only for W ? 16
W defined W defined W
defined 8 97.36 11 99.63 14 99.94 9 98.53 12 9
9.73 15 99.98 10 99.27 13 99.91 16 100.0
32
Parameterising Teiresias (cont.)
If we could make sequences more similar, all
distances should be defined at smaller W
  • Encode using chemical equivalences (CE)
  • AG, DE, FY, KR, ILMV, QN, ST
  • X DDELPMKEGDCMTI DDDIPIKDADCISI
    Y
    EEDIDLHLGDILTV DDDIDIHIADIISI

With BAliBASE, all distances are now defined at W
8 For comparison purposes, base distances on
original AA data
33
Pattern-based distances AA alphabet linear up
to 2.5 substitutions/site acceptable
dynamic range CE alphabet similar linearity
with greater dynamic range Variant dPBMC
(CE) high variance at higher sequence
divergence
AA
dPBMC
CE
34
Step 2 Using extracted homology information to
infer trees
35
One alternative Having computed all pairwise
distances from the homology information), we can
generate a distance tree using e.g.
neighbour-joining or Fitch-Margoliash
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
36
Other alternatives There are also ways to infer
trees from words and patterns that dont involve
computing a distance
Distancemethods
Distances
Extracted homology information
Un-aligned homologous sequences
Tree(s)
Direct alignment-free methods
37
One non-distance approach to inference from
extracted homology information
Every k-mer or pattern can be represented as a
distinct character with state 1 (present) or 0
(absent) k-mers AA AC AD AE AF . . .
ZZ Sequence A 1 1 0 0 0 . . . 0
Sequence B 1 0 0 1 1 . . . 0
Sequence Y 0 1 1 0 1 . . . 0 A
tree can then be inferred from this
character-state matrix using parsimony or
(better) a Bayesian approach
Other such strategies can be imagined
38
Comparing and evaluatingalignment-free approaches
  • We want to examine
  • Alignment-free distances are they linear?
    metric?
  • Pattern-based methods how to estimate
    parameters?
  • Influence of alphabet (full amino acid vs
    restricted)
  • Is the correct topology returned in standard
    cases?
  • Is the correct topology returned in difficult
    cases,
  • e.g. shuffled domains?
  • Computational complexity
  • Implementation in software

39
Synthetic datasets
  • Synthetic trees generated using PhyloGen
    (Rambaut)
  • Seven 8-taxon tree distributions with 100 trees
    each
  • Pairwise distances medians ( quartiles)
  • Sets 1-7 0.75 ... 3.42 0.38
    substitutions/site
  • Sequences evolved along trees under JTT model
    using
  • SEQ-GEN (Rambaut Grassly)
  • Control (no ASRV) 1000 AA, no ASRV
  • ASRV 1000 AA, high ASRV (a 0.5)

40
How accurate are phylogenetic trees inferred
using alignment-free distances?
Alignment-free approach
Compute distances
Tree(s)
Synthetic datasets
Tree-comparison metrics
Classical approach
Tree(s)
MSA
41
Robinson-Foulds topology-comparison metric
  • RF(T1,T2) 2
  • RF(T1,T3) 6 (maximum)

T1
T3
T2
42
Tree reconstruction accuracy
Control data (no ASRV)
43
Tree reconstruction accuracy
ASRV data
44
Statistical assessment (1)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method
45
Statistical assessment (2)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW
46
Statistical assessment (3)
All pattern-based (PB) variants are significantly
more-accurate than any other alignment-free
method Most alignment-free methods are
statistically indistinguishable from each other
-- including Bayesian binary (B-bin)
-- all are better than dC and dW dML is
significantly more-accurate than any
alignment-free method
47
Empirical (putative ortholog) datasets
  • 22,432 optimised MSAs and the corresponding
    protein-family trees (N 4 taxa) from 144
    prokaryote genomes (Beiko, Harlow Ragan)
  • We further require
  • -- all sequences from taxa with 4
    representative genomes in dataset
  • -- strong support for clade (deep branch
    with BPP 0.95)
  • -- alignment length 200 AA with ? 10 weak
    columns
  • Sort trees by number divergence of sequences
  • -- number few (4-8) vs many (12-20)
  • -- divergence short (0.5-1.0) vs long
    (2.5-3.0) mean subst/site, SD ? 0.5
  • Thus four subsets
  • -- few-short (50 alignments)
  • -- few-long (52 alignments)
  • -- many-short (80 alignments)
  • -- many-long (38 alignments)

48
False negative distances (x10) for
empirical (144-genome) dataset, ordered by
ranksum
49
Another way of looking at performance counts of
unrecovered deep branches, ordered by
50
Selection of optimal k (k-mer size)
  • Data generated with/without across-sites rate
    variation (ASRV)
  • Calculate distances varying k (k-mer word length)
  • Trees via neighbour-joining (NJ)
    Fitch-Margoliash (FM)
  • Robinson-Foulds (RF) Quartet (Q) distance
    comparison metrics
  • 4 combinations RF-NJ, RF-FM, Q-NJ, Q-FM

k-mer distance approach, varying k
(Alphabet reduction)
Tree(s)
Tree-comparison metrics
Synthetic data
MSA approach
Tree(s)
51
  • dE with sets 1, 4 7
  • Alphabets AA CE
  • Distance minimised at k 3-5
  • Representative for most word-based AF methods

Set 1
Set 4
Set 7
52
  • dC with sets 1, 4 7
  • Alphabets AA CE
  • Distance minimised at k 4-5
  • Rough parameter space

Set 1
Set 4
Set 7
53
B-bin for set 2 control vs ASRV data
  • Alphabets AA CE
  • Distance minimised at k 3-4
  • Representative for most word-based AF methods

Control
ASRV
54
B-bin for set 4 control vs ASRV data
  • Distances minimised at k 3-5
  • RF distance halved when ASRV

Control
ASRV
55
B-bin for set 6 control vs ASRV data
  • Distance minimised at k 3-4
  • Much smaller distance when ASRV
  • Marked optimum when ASRV

Control
ASRV
56
A broadly optimal k (k-mer size) ?
Topological difference between alignment-free and
MSA trees is minimised for all approaches and
datasets within narrow range of k For full AA
alphabet, k 3-5 For reduced (CE) alphabet, k
4-6 a pleasantly surprising result (but no
theory to indicate why this might be the case)
57
Summary
  • It is indeed possible to compute reasonably
    accurate phylogenetic trees from un-aligned
    molecular sequences
  • These trees are less accurate than those inferred
    from aligned sequences using the best methods
  • Properly parameterised patterns extract homology
    information more fully than do non-degenerate
    k-mers (words)
  • Word length (k) shows an unexpectedly tight range
    of optimality across the datasets we examined
  • A reduced alphabet (CE) yields more-accurate
    trees when distances are relatively large
  • A Bayesian alternative yields good (but not the
    best) trees
  • We introduce an order-of-magnitude faster
    implementation (PB-SIM) at a small cost
    in accuracy

58
Perspective outlook Unlike MSA algorithms,
which have been under development for decades,
alignment-free methods are newer, suggesting that
there may be substantial scope for
improvement However, we should be cautious about
introducing computationally intensive
refinements, as these could undermine one of the
major motivations for alignment-free methods in
the first place Although the Bayesian approach
didnt yield more-accurate trees, it may
nonetheless offer other advantages, e.g. based on
posterior probabilities or ML estimates of branch
lengths Alignment-free methods might find other
applications, e.g. in phylogenetic inference
based on metagenomic data, low-coverage genomic
sequence, ESTs, or unalignable (e.g.
intergenic) regions
59
Software available
  • Package decafpy
  • (DistancE Calculation using Alignment-Free
    methods in Python)
  • Available from http//www.bioinformatics.org.au/
  • Free (under GPL)
  • Suite of commandline tools, complete with
    description of all options, an object-oriented
    library, and programmatic access (API)

60
Acknowledgements
Michael Höhl !!!
  • Isidore Rigoutsos, IBM TJ Watson Research Center
  • Rob Beiko, IMB / Dalhousie University
  • Denis Baurain, Université de Liège
  • Tamir Tuller, Tel-Aviv University

61
Special thanks to Australian Research
Council ARC Centre of Excellence in
Bioinformatics Institute for Molecular Bioscience
Write a Comment
User Comments (0)
About PowerShow.com