Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Algorithms and Data Structures

Description:

IF (i, j, k) is not on the shortest path from (0,0,0) to (n,n,n) ... More importantly, avoid putting its neighbors, not already in the queue, into the queue. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 73
Provided by: john244
Learn more at: https://cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures


1
Bioinformatics Algorithms and Data Structures
  • Chapter 14.6-8 Multiple Alignment
  • Lecturer Dr. Rose
  • Slides by Dr. Rose
  • March 5, 2007

2
Sum-of-Pairs
  • Defn. The sum-of-pairs (SP) score of a multiple
    alignment is the sum of the score of all induced
    pairs in a global alignment.
  • From the previous example
  • 1 A A T - G G T T T
  • 2 A A - C G T T A T
  • T A T C G - A A T
  • SP 4 5 4 13

3
Sum-of-Pairs
  • Q What theoretical justification is there for
    adopting the SP score?
  • Wait for response..
  • A None. Or rather none more than for any other
    multiple alignment scoring scheme.
  • In practice it is a good heuristic and is
    popular.
  • Q How can we compute a global alignment M using
    a minimum sum-of-pairs score?
  • A Why dynamic programming of course!

4
Sum-of-Pairs
  • Assuming that we want to align k strings
  • Q What time complexity for the DP solution?
  • A Q(nk), exact SP aligment has been shown to be
    NP-complete.
  • Q So what should we do?
  • A Choose small a k.
  • In practice, the NP-completeness of a problem
    often does not mean that the sky is falling.

5
Sum-of-Pairs
  • Q How will k affect the recurrence relation?
  • The recurrence relation for k 3 is
  • D(i, j, k) min
  • D(i -1, j - 1, k - 1) ?,
  • D(i -1, j - 1, k ) ?,
  • D(i -1, j, k - 1) ?,
  • D(i, j - 1, k - 1) ?,
  • D(i -1, j , k ) ?,
  • D(i, j - 1, k ) ?,
  • D(i, j , k - 1) ?

6
Sum-of-Pairs
  • Lets consider each term of the recurrence in
    turn
  • D(i -1, j - 1, k - 1) is the diagonal cell in
    all three dimensions.
  • Q What should be the SP transition cost for
    D(i-1,j-1,k-1) ?D(i, j, k) ?
  • Recall for k 2, if S1(i) S2(j) the cost is
    the match cost, o/w S1(i) ? S2(j) and we incur
    the mismatch cost.
  • A the sum of pairwise match comparisons, i.e.,
    ij, jk, ik.

7
Sum-of-Pairs
  • Let m(i, j) denote the pairwise character match
    function defined as
  • m(i, j) matchCost if the characters match
  • m(i, j) mismatchCost if the characters mismatch
  • Then the SP transition cost for D(i - 1, j - 1, k
    - 1) ?D(i, j, k) is m(i, j) m(j, k) m(i, k)
  • Hence the term cost is
  • D(i - 1, j - 1, k - 1) m(i, j) m(j, k)
    m(i, k)

8
Sum-of-Pairs
  • The next term
  • D(i -1, j - 1, k ) is the diagonal cell in the
    first two dimensions.
  • Q What should be the SP transition cost for
    D(i-1, j-1, k) ?D(i, j, k) ?
  • We have two types of cases to consider
  • The pairwise diagonal case i-1, j-1? i, j
  • The two pairwise space insertion cases
  • i-1, k ? i, k and j-1, k ? j, k

9
Sum-of-Pairs
  • The cost will be the sum of the pairwise match
    and space insertion costs.
  • m(i, j) for (i-1, j-1? i, j) and
  • spacecost for i-1, k ? i, k and spacecost for
    j-1, k ? j, k
  • Then the SP transition cost for D(i - 1, j - 1,
    k) ?D(i, j, k) is m(i, j) 2 spacecost
  • Hence the term cost is
  • D(i - 1, j - 1, k) m(i, j) 2 spacecost

10
Sum-of-Pairs
  • Similarly, the third and fourth term costs are
  • D(i - 1, j, k - 1) m(i, k) 2 spacecost,
  • D(i, j - 1, k - 1) m(j, k) 2 spacecost
  • Note the similarity in the fifth, sixth, and
    seventh terms
  • D(i -1, j , k ) ?
  • D(i, j - 1, k ) ?
  • D(i, j , k - 1) ?
  • Q What should be the cost for transitions from
    them?

11
Sum-of-Pairs
  • For D(i -1, j , k) we have two types of cases to
    consider
  • The pairwise no change case j, k ? j, k
  • The two pairwise space insertion cases
  • i-1, j ? i, j and i-1, k ? i, k
  • Then the SP transition cost for D(i - 1, j , k)
    ?D(i, j, k) is 0 2 spacecost
  • Hence the term cost is
  • D(i - 1, j, k) 2 spacecost

12
Sum-of-Pairs
  • Similarly, the sixth and seventh term costs are
  • D(i - 1, j, k) 2 spacecost,
  • D(i, j, k) 2 spacecost
  • Hence D(i, j, k) min
  • D(i -1, j - 1, k - 1) m(i, j) m(j, k) m(i,
    k),
  • D(i -1, j - 1, k ) m(i, j) 2 spacecost,
  • D(i -1, j, k - 1) m(i, k) 2 spacecost,
  • D(i, j - 1, k - 1) m(j, k) 2 spacecost,
  • D(i -1, j , k ) 2 spacecost,
  • D(i, j - 1, k ) 2 spacecost,
  • D(i, j , k - 1) 2 spacecost

13
Sum-of-Pairs
  • Q What about the boundary cells on the 3 faces
    of the table?
  • D(i, j, 0),
  • D(i, 0, k),
  • D(0, j, k)
  • Observation Each case degenerates into the
    familiar two-string alignment distance space
    costs for the empty string argument.
  • Approach represent these cases in terms of
    pair-wise distance space costs.

14
Sum-of-Pairs
  • Let D1,2(i, j) denote the pairwise distance
    between S11..i and S21..j. D1,3(i, k) and
    D2,3(j, k) are analogously defined.
  • Consider D(i, j, 0)
  • D(i, j, 0) D1,2(i, j) ? spaceCost
  • Q What is the space cost, i.e., how many spaces?
  • A i for S1 and j for S2 hence
  • D(i, j, 0) D1,2(i, j) (i j) spaceCost

15
Sum-of-Pairs
  • By this argument, the boundary cells are given
    by
  • D(i, j, 0) D1,2(i, j) (i j) spaceCost ,
  • D(i, 0, k) D1,3(i, k) (i k) spaceCost ,
  • D(0, j, k) D2,3(j, k) (j k) spaceCost,
  • D(0,0,0) 0

16
Sum-of-Pairs Speedup
  • Q How can we speedup our DP approach?
  • A Use forward dynamic programming.
  • Note so far we have used backward dynamic
    programming, i.e., cell (i, j, k) looks back to
    the seven cells that can influence its value.
  • In contrast forward DP sends the result of cell
    (i, j, k) forward to the seven cells whose value
    it could influence.

17
Sum-of-Pairs Speedup
  • Q How does this speed things up?
  • A it doesnt, if we always send cell (i, j, k)s
    value forward.
  • The only significant way to speed up the Q(nk) is
    to avoid computing all nk cells in the DP table.
  • We will use forward DP to reduce the number of
    cells that we compute in the DP table.

18
Sum-of-Pairs Speedup
  • Lets rethink this problem
  • View the optimal alignment problem as the
    shortest path through the weighted edit distance
    graph.
  • We are looking for the shortest path from (0,0,0)
    to (n,n,n).
  • When node (i, j, k) is computed, we have the
    shortest path from (0,0,0) to (i, j, k).
  • The value of node (i, j, k) is sent forward to
    the seven neighboring nodes that it can influence

19
Sum-of-Pairs Speedup
  • Let w be reached by an outgoing edge from (i, j,
    k)
  • the true shortest distance from (0,0,0) to w is
    the value computed after it has been updated by
    every node with a ingoing edge to it.
  • A queue is used to order the nodes for
    processing.
  • The final shortest distance for the node v at the
    head of the queue is set and node v is removed.
  • Every neighbor w of v is then updated, w is
    placed in the queue if it is not already there.

20
Sum-of-Pairs Speedup
  • At this point we borrow an A-like idea
  • IF (i, j, k) is not on the shortest path from
    (0,0,0) to (n,n,n) then avoid passing its value
    forward.
  • More importantly, avoid putting its neighbors,
    not already in the queue, into the queue.
  • The trick is deciding (i, j, k) is not on the
    shortest path from (0,0,0) to (n,n,n).
  • Q How do we pull this rabbit out of our hat?

21
Sum-of-Pairs Speedup
  • Define d1,2(i, j) to be the edit distance between
    suffixes S1i..n and S2j..n. Define d1,3(i, k)
    d2,3(j, k), analogously.
  • Note these edit distances can be computed in
    O(n2) via DP on the reversed strings.
  • Observation any shortest path from (i, j, k) to
    (n,n,n) must have distance at least d1,2(i, j)
    d1,3(i, k) d2,3(j, k)

22
Sum-of-Pairs Speedup
  • Suppose we have an alignment (from somewhere)
    with an SP distance score z.
  • Core idea
  • if D(i, j, k) d1,2(i, j) d1,3(i, k) d2,3(j,
    k) gt z, then node (i, j, k) can not be on any
    shortest path.
  • Do not pass its value forward.
  • Do not put its neighbors reached by outgoing
    edges onto the queue.

23
Sum-of-Pairs Speedup
  • Benefits of being able to prune cell (i, j, k)
  • We automatically prune many of its descendants.
  • We dont process all nk cells in a k-string
    problem. Big win!!!!
  • The computation is still exact will find the
    optimal alignment.

24
Sum-of-Pairs Speedup
  • The program called MSA implements the speedup we
    are discussing.
  • Cold shower
  • MSA can align 6 strings with n 200
  • Unlikely to be able to align tens or hundreds of
    strings.
  • Still, 2006 cells ( 6.4 1013 cells), otherwise
    impossible.

25
Bounded-Error Approximation for SP-Alignment
  • Q Where do we get z from?
  • A We will use a bounded-error approximation
    method.
  • Properties of the specific method we will
    discuss
  • Polynomial worst-case time complexity
  • The SP-score is less than twice the optimal value.

26
Bounded-Error Approximation for SP-Alignment
  • Idea focus on alignments consistent with a tree.
  • Q What do we mean by consistent with a tree?
  • Informal explanation
  • A graph edge denotes a relation between two
    nodes.
  • Recall that D(Si, Sj) is the optimal weighted
    distance between Si and Sj.
  • We could let D(Si, Sj) be the edge relation.

27
Bounded-Error Approximation for SP-Alignment
  • Informal explanation
  • A graph edge denotes a relation between two
    nodes.
  • Recall that D(Si, Sj) is the optimal weighted
    edit distance between Si and Sj.
  • We could let D(Si, Sj) be the edge relation
    between the node labeled Si and the node labeled
    Sj.

28
Bounded-Error Approximation for SP-Alignment
  • Informal explanation continued
  • Suppose we have a multiple alignment M.
  • Suppose we construct an unrooted tree from a
    subset of such edges? between nodes labeled with
    strings from M.
  • We call the alignment of the strings represented
    in the tree consistent with the tree.
  • ? recall D(Si, Sj) is the edge relation.

29
Bounded-Error Approximation for SP-Alignment
  • Example from text
  • A X X _ Z
  • A X _ _ Z
  • A _ X _ Z
  • A Y _ _ Z
  • A Y X X Z

30
Bounded-Error Approximation for SP-Alignment
  • Defn. More formally, let
  • S be a set of distinct strings.
  • T be an unrooted tree comprised of nodes labeled
    with strings from set S.
  • M be multiple alignment of the strings in S.
  • M is consistent with T if the induced pairwise
    alignment of Si and Sj has score D(Si, Sj) for
    each pair of strings (Si, Sj) that label adjacent
    nodes in T.

31
Bounded-Error Approximation for SP-Alignment
  • Thm. For any set of strings S and for any tree T
    whose nodes are labeled by distinct strings from
    set S, we can efficiently find a multiple
    alignment M(T) of S that is consistent with T.
  • Proof sketch construct M(T) of S one string at a
    time.
  • Base case
  • Pick two strings Si and Sj labeling nodes
    adjacent in T.
  • Create M2(T) a two string alignment with
    distance D(Si,Sj).

32
Bounded-Error Approximation for SP-Alignment
  • Inductive Hypothesis Assume the theorem holds
    for 2 lt k strings, i.e., Mk(T) is consistent with
    T.
  • Inductive Step show that the theorem holds for k
    1 strings.
  • Pick a string Sj not in Mk(T) such that it labels
    a node adjacent to a node labeled Si already in
    Mk(T).
  • Optimally align Sj with Si? (Si with spaces in
    Mk(T)).
  • Add Sj? (Sj with spaces) to Mk(T) creating
    Mk1(T).
  • Look at detailed proof (pg. 348) to see how the
    issue of inserted spaces is handled.

33
Bounded-Error Approximation for SP-Alignment
  • By construction
  • Sj and Si have distance D(Si, Sj)
  • Mk1(T) is consistent with T.
  • By induction, M(T) of S is consistent with T and
    is efficiently computed.

34
Bounded-Error Approximation for SP-Alignment
  • We need some more definitions at this point
  • Defn. the center string Sc ? S, a set of k
    strings, is the string that minimizes M SSj?S
    D(Sc, Sj).
  • Defn. the center star is a star tree of k nodes,
    with the center node Sc and each of the k-1
    remaining nodes labeled by a distinct string in S
    Sc.

35
Bounded-Error Approximation for SP-Alignment
  • Defn. the multiple alignment Mc of strings in S
    is the multiple alignment consistent with the
    center star.
  • Defn. let d(Si, Sj) denote the score of the
    pairwise alignment of strings Sj and Si induced
    by Mc.
  • Defn. let d(M) denote the score of the alignment
    M.
  • Observations
  • d(Si, Sj) ? D(Si, Sj)
  • d(Mc) Siltjd(Si, Sj).

36
Bounded-Error Approximation for SP-Alignment
  • Defn. the triangle inequality wrt a scoring
    scheme is defined as the relation s(x, z) ? s(x,
    y) s(y, z) for any three characters x, y, and
    z.
  • We can extend the triangle inequality from the
    scoring scheme for characters to string alignment.

37
Bounded-Error Approximation for SP-Alignment
  • Lemma. If a 2-string scoring scheme that
    satisfies the triangle inequality is used, then
    for any Si Sj
  • d(Si, Sj) ? d(Si, Sc) d(Sc, Sj) D(Si, Sc)
    D(Sc, Sj)
  • Proof sketch Notice that for each column we
    have
  • s(x, z) ? s(x, y) s(y, z)
  • The inequality in the lemma follows immediately.
  • The equality holds since all strings are
    optimally aligned with Sc.

38
Bounded-Error Approximation for SP-Alignment
  • We can now establish the bounded-error
    approximation
  • Defn. Let M denote the optimal alignment of the
    k string of S.
  • Defn. Let d(Si, Sj) denote the pairwise
    alignment score of the strings Si and Sj induced
    by M.

39
Bounded-Error Approximation for SP-Alignment
  • Thm. d(Mc)/d(M) ? 2(k 1)/k lt 2
  • See proof on page 350 for details. (basically
    depends on the previous lemma)
  • Corollary
  • kM ? SiltjD(Si, Sj) ? d(M) ? d(Mc) ? 2(k
    1)/k SiltjD(Si, Sj)
  • Recall that M SSj?SD(Sc, Sj)
  • The alignment score D(Si, Sj) is not based on Mc
    or M
  • Observation d(Mc)/SiltjD(Si, Sj) gives a measure
    of the goodness of Mc and is guaranteed to be
    less than 2.

40
Consensus Objective Functions
  • First fact of consensus representations
  • There is no consensus as to how to define
    consensus.
  • Consequently, we will look at several
    definitions.
  • Steiner consensus strings
  • Defn. Given a set of string S and a string S,
    the consensus error of S relative to S is E(S)
    SSj?SD(S, Sj).
  • S is not required to be a member of S.

41
Consensus Objective Functions
  • Defn. Given a set of strings S, an optimal
    Steiner string S for S minimizes the consensus
    error E(S).
  • S is not required to be a member of S.
  • Observations
  • in S we are trying to capture the essential
    common features in S.
  • Computing E(S) appears to be a hard problem.

42
Consensus Objective Functions
  • No known efficient method for finding S.
  • We will consider an approximate method.
  • Lemma Assume that S contains k strings and that
    the scoring scheme satisfies the triangle
    inequality. There exists a string S? S such that
    E(S)/E(S) ? 2.
  • Q What does this lemma say?
  • (Proof sketch next slide)

43
Consensus Objective Functions
  • Proof sketch
  • For any i, D(S, Si) ? D(S, S) D(S, Si) so,
  • E(S) SSj?S D(S, Sj) and
  • SSj?S D(S, Sj) ? SSj?S D(S, S) D(S, Sj)
  • But SSj?S D(S, S) D(S, Sj) (k-2) D(S,
    S) E(S)
  • Therefore E(S) ? (k-2) D(S, S) E(S)

44
Consensus Objective Functions
  • QWhere do we find a good candidate for S?
  • A Sc, the center string.
  • Recall Sc minimizes SSj?S D(Sc, Sj).
  • Thm. E(Sc)/E(S) ? 2 - 2/k, assuming the scoring
    scheme satisfies the triangle inequality.
  • Proof. Follows immediately from the previous
    lemma and the observation that E(Sc) ? E(S)

45
Consensus Objective Functions
  • Consensus strings from multiple alignment
  • Defn. Let M be a multiple alignment of strings S,
    the consensus character of column i of M is the
    character that minimizes the summed distance to
    all the characters in column i.
  • Note
  • the summed distance depends on the pairwise
    scoring scheme.
  • The plurality character is the consensus
    character for some scoring schemes.

46
Consensus Objective Functions
  • Defn. Let d(i) denote the minimum sum in column
    i.
  • Defn. The consensus string SM derived from
    alignment M is the concatenation of consensus
    characters for each column of M.
  • Q How can we evaluate the goodness of SM ?
  • A One possibility is Goodness(SM ) SiD(SM,
    Si), i.e., see how good of a Steiner string SM
    is.
  • Consider a different approach..

47
Consensus Objective Functions
  • Defn. The alignment error of SM, a consensus
    string containing q characters, is Sqi1d(i).
  • Defn. The alignment error of M is defined as the
    -alignment error of SM, its consensus string.
  • Example
  • 1 A A T - G - T T T
  • 2 A A - C G T T A T
  • T A T C G - A A T

A A T C G - T A T Consensus (alignment
error of ?)
48
Consensus Objective Functions
  • Defn. The optimal consensus multiple alignment is
    a multiple alignment M whose consensus string SM
    has the smallest alignment error over all
    possible multiple alignments of S.

49
Consensus Objective Functions
  • The 3 notions of consensus we have discussed are
  • The Steiner string S defined from S.
  • The consensus string SM derived from M, with
    goodness related to its function as a Steiner
    string.
  • The consensus string SM derived from M, with
    goodness related to is ability to reflect the
    column-wise properties of M.
  • Surprisingly (or not) they lead to the same
    multiple alignment.

50
Consensus Objective Functions
  • Lets investigate the assertion these concepts
    result in the same multiple alignment.
  • Let S be a set of k strings.
  • Let T be the star tree with Steiner string S at
    the root and each of the k strings of S at
    distinct leave of T, then
  • Defn. the multiple alignment consistent with S
    is the multiple alignment of S ? S consistent
    with T.

51
Consensus Objective Functions
  • Thm. Let S denote the consensus string of the
    optimal consensus multiple alignment.
  • Removing the spaces from S results in the optimal
    Steiner string S.
  • Removal of S from the multiple alignment
    consistent with S results in the optimal
    consensus multiple alignment of S.
  • Proof on page 353.

52
Consensus Objective Functions
  • Q Why should we care about this theorem?
  • A The theorem stating E(Sc)/E(S) ? 2 - 2/k
    plus this theorem can be used to approximate the
    optimal consensus alignment
  • Find the center string Sc. Recall the center
    string Sc ? S, a set of k strings, is the string
    that minimizes M SSj?S D(Sc, Sj).
  • Place Sc at the center of a k node star.

53
Consensus Objective Functions
  • Label each leaf with a string from S.
  • Construct the multiple alignment M consistent
    with this tree T.
  • Recall M is consistent with T if the induced
    pairwise alignment of Si and Sj has score D(Si,
    Sj) for each pair of strings (Si, Sj) that label
    adjacent nodes in T.

54
Consensus Objective Functions
  • Revelation The multiple alignment M is the same
    as Mc used to approximate the SP objective
    function.
  • Thm. The multiple alignment Mc created by the
    center star method has
  • An SP score ? (2-2/k) score of the optimal SP
    alignment.
  • A consensus alignment error ? (2-2/k) the
    alignment error of the optimal consensus multiple
    alignment.

55
Phylogenetic trees Multiple alignment
  • Phylogenetic tree a depiction of the
    evolutionary history of set of taxa. The leaves
    of the tree are labeled by taxa names.
  • Convention
  • Each edge (u,v) denotes an ancestor-descendant
    relation.
  • This relation may be on the basis of
    morphological attributes or sequence similarity.
  • The internal nodes represent extinct taxa.
  • The leafs represent currently existing taxa.

56
Phylogenetic trees Multiple alignment
  • Two related problems
  • Problem find a multiple alignment for a tree
  • Given a phylogenetic tree, deduce sequences for
    the internal nodes to optimize some objective
    function.
  • Find the multiple alignment consistent with the
    tree.
  • Delete the deduced sequences (internal node
    labels)
  • Find a tree from a set of leaf sequences.(Chapter
    17)

57
Phylogenetic trees Multiple alignment
  • Let T be a tree with leaf nodes labeled with
    distinct strings from a set S.
  • Defn. a phylogenetic alignment for T is an
    assignment of one string to each internal node.
  • Note strings labeling internal nodes need not
    come from S.

58
Phylogenetic trees Multiple alignment
  • Recall that D(S1, S2) denotes the edit distance
    between strings S1 and S2.
  • Defn. The edge distance of edge (i, j) is D(Si,
    Sj) where Si and Sj are the strings labeling
    nodes i and j, respectively.
  • Defn. Path distance is the sum of edge distances
    along the path.
  • Defn. Phylogenetic alignment distance is the sum
    of all edge distances in the tree.

59
Phylogenetic trees Multiple alignment
  • Phylogenetic alignment problem for T
  • Find an assignment of strings to internal nodes
    of T that minimizes the distance of the alignment.

60
Phylogenetic trees Multiple alignment
  • Phylogenetic alignment problem for T
  • The general problem is too hard (NP-complete).
  • We will consider a heuristic approximate
    solution.
  • The solution is within twice the minimal
    distance.
  • The approach has polynomial time complexity.

61
Phylogenetic trees Multiple alignment
  • Defn. A lifted alignment is a phylogenetic
    alignment in which the string assigned to each
    internal node is also assigned to one of its
    children.
  • Example

62
Phylogenetic trees Multiple alignment
  • Lifted Alignment Observation
  • Each internal node v is labeled by a leaf label
    appearing in the subtree rooted at v.

63
Phylogenetic trees Multiple alignment
  • Plan
  • Construct a lifted alignment TL.
  • Initial approach conceptually transform the
    optimal phylogenetic alignment.
  • Q Why do we say conceptually?
  • A Because we dont have T, the optimal
    phylogenetic alignment.
  • Demonstrate property of TL total distance lt
    twice optimal phylogenetic alignment distance.
  • Next show how to compute TL efficiently using DP.

64
Phylogenetic trees Multiple alignment
  • Creating TL
  • Start with input tree T, with leafs labeled by
    distinct strings.
  • Let T denote the optimal phylogenetic alignment
    for T. (This is the assignment of strings to
    internal nodes of T that minimizes the total of
    all edge distances.)
  • Successively lift each internal node.
  • An internal node can only be lifted if all of its
    children have been lifted.
  • Leaf nodes are defined to be lifted.

65
Phylogenetic trees Multiple alignment
  • Q How do we lift a node?
  • Let Sv denote the label of node v in T.
  • Assume that vs children have been lifted.
  • WLOG let the labels of vs children be S1,
    S2,..,Sk from S.

66
Phylogenetic trees Multiple alignment
  • Find the string Sj among the children that is
    closest to Sv, i.e., the string Sj such that
    D(Sv, Sj) ? D(Sv, Si) for all i from1 to k.
  • Replace Sv,with Sj.

67
Phylogenetic trees Multiple alignment
  • Claim The lifted alignment TL has total distance
    less or equal to twice that of the optimal
    phylogenetic alingment T of T.
  • Sketch of proof
  • Suppose e(v, w) (v the parent of w) is a
    nonzero-length edge in TL.
  • Suppose v is labeled Sj ? S, and w is labeled Si
    ? S.
  • If Sj ? Si then the distance of e in TL is D(Sj,
    Si) ? D(Sj, Sv) D(Sv, Si).
  • But D(Sj, Sv) D(Sv, Si) ? 2 D(Sv, Si)
  • Q Why is this true?
  • A because D(Sj, Sv) ? D(Sv, Si)

68
Phylogenetic trees Multiple alignment
  • Sketch of proof (continued)
  • What about paths?
  • Let Pe denote the path from v to the leaf labeled
    Si in T. The distance is at most the sum of the
    edge distances.
  • In TL, if e is a nonzero-length edge, then this
    path has distance at most twice Pe.

69
Phylogenetic trees Multiple alignment
  • The lifted alignment can be computed with DP.
  • Let Tv be the subtree of T rooted at node v.
  • Defn. d(v, S) denotes the distance of the best
    lifted alignment of Tv where v is labeled with S.
  • Obviously, S must be the label of a leaf in Tv.

70
Phylogenetic trees Multiple alignment
  • d(v, S) is computed from the leaves up.
  • The leaves are already considered lifted.
  • d(v, S) for a parent of leaves is computed by
    d(v, S) SS D(S, S) where S is the label of a
    child of v.
  • The general recurrence for an internal node is
    d(v, S) Sv minS D(S, S) d(v, S) ,
    where v is a child of v and S labels a leaf in
    Tv.

71
Phylogenetic trees Multiple alignment
  • Time analysis
  • Assume that T has k leaves.
  • Assume that all pairwise distances have been
    computed.

Q How long does this take? A O(N2) where N is
the total length of all the k strings. Why is
this true? How can we explain it?
72
Phylogenetic trees Multiple alignment
Time analysis The processing at an internal node
is O(k2). Why is this true? Then the total time
is O(N2 k3). Why O(N2 k3) and not O(N2
k2)? Bottom line we can compute the optimal
lifted alignment in time that is polynomial in
the length of the strings and size of the tree.
Write a Comment
User Comments (0)
About PowerShow.com