Pairwise alignment algorithms Haverford College, Fall 2001 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Pairwise alignment algorithms Haverford College, Fall 2001

Description:

... Statistical Methods in Bioinformatics: An Introduction, Springer-Verlag, ... Most textbooks on Computational Biology/Bioinformatics cover these topics ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 55
Provided by: Elisabett70
Category:

less

Transcript and Presenter's Notes

Title: Pairwise alignment algorithms Haverford College, Fall 2001


1
Pairwise alignment algorithms(Haverford
College, Fall 2001)
  • Elisabetta Manduchi
  • manduchi_at_pcbi.upenn.edu
  • phone 215-573-4408

2
References
  • Ewens and Grant, Statistical Methods in
    Bioinformatics An Introduction, Springer-Verlag,
    New York, 2001
  • Sections 6.2 and 6.4 (all subsections)
  • See also problems 6.1-6.5
  • Background appendix B, sections B.1-B.9
  • Most textbooks on Computational
    Biology/Bioinformatics cover these topics

3
Algorithms definition
  • Websters definition
  • a procedure for solving a mathematical problem
    in a finite number of steps that frequently
    involves a repetition of an operation or
    broadly a step-by-step procedure for solving a
    problem or accomplishing some end

4
Time and Space Complexity
  • Time Complexity or Running Time of an algorithm
    number of steps.
  • Space Complexity amount of temporary storage
    required to run the algorithm.
  • These are functions of the input size.
  • They refer to the worst case scenario.

5
Example
  • Given a list of numbers x1,x2,,xn, where each xi
    is 0 or 1, find the first 0, if any, in this list
    and change it to a 1.
  • For i1,2,,n, if xi0, set xi1. Stop.
  • Input size is n.
  • Total number of steps? Depends on the input, e.g.
  • 1 for 0,1,1,1
  • 3 for 1,1,0,0
  • In the worst case scenario, the total number of
    steps is n. This is the time complexity of this
    algorithm.

6
The O notation
  • Ignore constant factors when trying to evaluate
    time or space complexity.
  • Definition Let f and g be positive-valued
    functions of n. We say that f(n)O(g(n)) as n??
    if there exist constants c?0 and N such that, for
    all n?N, we have f(n)?cg(n).
  • That is, if for sufficiently large values of n,
    f(n) is no bigger than some fixed non-negative
    constant times g(n).

7
Examples
  • 2n3O(n3)
  • 10n415O(n4)O(n5)
  • a polynomial of degree s is O of any polynomial
    of degree t, for t?s. In particular it is O(nt).
  • If agt1 and if g is a polynomial, then f(n)an is
    NOT O(g(n))

8
Substitutions, Insertions, Deletions
  • Mutation one of
  • switch from one nucleotide to another
  • insertion
  • deletion
  • Substitution a switch in nucleotides which
    spreads throughout most of a species.
  • Substitutions, insertions and deletions passed
    along two independent lines of descent cause a
    divergence of the two sequences from the original
    (and from each other)

  • cgggtatccaa
  • cggtatgcca

  • ccctaggtccca

9
Questions
  • Given a collection of sequences. Do they descend
    from the same ancestor sequence?
  • If so, how did they diverge from their ancestor?
    I.e. how should they be aligned?

10
Example
  • For the previous example
  • cggtatgcca? cgggtatccaa , ccctaggtccca, the two
    descendent sequences align as follows
  • c g g g t a - - t - c
    c a a
  • c c c - t a g g t c c
    c - a
  • - (indel) represents an insertion or deletion.
  • A sequence of ? consecutive indels is called a
    gap of length ? .

11
Alignments
  • A means of comparing sequences to answer the
    questions stated above.
  • Types of alignments
  • global/local
  • gapped/ungapped
  • pairwise/multiple
  • In what follows we will focus on pairwise
    alignments.

12
Alignments (cont.)
  • Given two sequences, find an optimal alignment
    between them and use it to answer the questions
    stated above.
  • What is an optimal alignment?
  • Need a way to compare alignments.
  • Attach a score to each alignment.
  • This should reflect the likelihood that this
    alignment was produced as a consequence of
    divergence from a common ancestor.

13
Scoring schemes
  • Given a scoring scheme,
  • an optimal alignment between two sequences is one
    with the best score (there might be more than one
    optimal alignment).
  • the score of the sequence pair is such a best
    score.
  • Using the scores of sequence pairs one can
  • investigate the hypothesis that two sequences
    diverged from a common ancestor
  • use the alignment of a pair of sequences that are
    judged to be related in order to discover common
    patterns.
  • by comparing scores among different species, get
    information to help reconstruct the phylogenetic
    tree that relates them all.

14
Types of scores
  • Similarity Scores the higher the score, the more
    closely related are the two aligned sequences.
  • Distance scores (or distance measures) the lower
    the score, the more closely related the
    sequences.
  • In what follows we will use similarity score.

15
Using substitution matrices and gap penalties
  • Score of an alignment sum of individual scores,
    one for each aligned pair of residues, and a
    score (penalty) for each gap.
  • A substitution matrix s is a (symmetric) matrix
    where s(X,Y) is the score assigned to the aligned
    pair of residues X and Y.
  • A gap penalty is usually a function of the length
    ? of a gap ?(?). The simplest example is the
    linear gap model ?(?) -?d, for some
    non-negative linear gap penalty d.

16
Substitution Matrices
  • A 4?4 (NA) or a 20?20 (AA) symmetric matrix.
  • Examples
  • see table 6.7.
  • s(X,Y)1 if XY, -1 otherwise.
  • In what follows we will assume that a scoring
    scheme, consisting of a substitution matrix and a
    gap penalty function, is given.
  • Substitution matrices (such as BLOSUM and PAM
    matrices) are discussed in detail in section 6.5
    of the Ewens and Grant book and will also be
    discussed next time.

17
Example
  • Let s(X,Y)1 if XY, -1 otherwise and use a
    linear gap score with d2. Then the score of the
    alignment
  • c t t a g - g - -
  • c a t - g a g a a
  • is 1 1 1 -2 1 -2 1 -2 ? 2 -5

18
Gapped Global Pairwise Alignments(linear gap
model)
  • Given
  • a scoring scheme described by a substitution
    matrix and a linear gap penalty
  • two sequences of lengths m and n xX1X2Xm and
    yY1Y2Yn
  • Goal
  • find the score of this sequence pair (i.e. the
    score of an optimal gapped global alignment
    between x and y)
  • find an optimal alignment of this type.

19
Naïve approach
  • Exhaustive search
  • List all possible global gapped alignments of x
    and y.
  • For each such alignment, compute its score using
    the given scoring scheme.
  • Find the maximum of the scores and the
    corresponding alignment(s).

20
Time complexity of the naïve approach
  • Let c(m,n) be the number of global gapped
    alignments between two sequences of lengths m and
    n respectively.
  • How does c grow with m and n?
  • Let g(m,n) be the number of groups obtained by
    grouping together those alignments that have the
    same combination of aligned residue pairs,
    ignoring indels.
  • e.g. X1 X2 X3 - X4 and X1 X2 - X3 X4
  • - Y1 - Y2 Y3 - Y1 Y2
    - Y3

21
Time complexity of the naïve approach (cont.)
  • Then g(m,n)?c(m,n).
  • But
  • The RHS equals (see problem 6.1).
  • This number grows quite fast with m and n.

22
Time complexity of the naïve approach (cont.)
  • To get a sense of how fast this number grows,
    assume mn.
  • Then (use
    Stirlings
  • approximation B.5).
  • For example

23
Needleman-Wunsch algorithm (1970)Gotohs version
(1982)
  • This is an example of dynamic programming
    algorithm
  • break the problem into sub-problems of the same
    kind
  • build the final solution using the solutions for
    the sub-problems.

24
Notation
  • s substitution matrix (for brevity, we will
    write s(i,j) for s(Xi,Yj)), d linear gap
    penalty
  • xX1X2Xm and yY1Y2Yn sequences to align
  • For i1,2,,m and j1,2,,n, let
  • x1,iX1X2Xi, y1,jY1Y2Yj
  • B(i,j) score of the pair (x1i,,y1,j)
  • B(i,0)-id, B(0,j)-jd, B(0,0)0
  • Get an (m1)?(n1) matrix B, whose entry B(m,n)
    is the desired score of the pair (x,y)

25
The algorithm
  • Fill in the elements of B recursively, proceeding
    from top left to bottom right.
  • A highest scoring alignment between x1,i and y1,j
    could terminate in one of 3 ways
  • Xi Xi
    -
  • Yj - Yj
  • In the first case, B(i,j) B(i-1,j-1)s(i,j)
  • In the second case, B(i,j) B(i-1,j)-d
  • In the third case, B(i,j) B(i,j-1)-d

26
Recursive formula
  • B(i,j) maxB(i-1,j-1)s(i,j), B(i-1,j)-d,
    B(i,j-1)-d
  • Thus, to compute B(m,n), the running time is
    O(mn).
  • To find an alignment that has this score, we must
    keep track, at each step of the recursion, of the
    choice made to get the max.

27
Example
  • xgaatct, ycatt (m6 and n4)
  • s(X,Y)1 if XY, -1 otherwise
  • d2

28
Example (cont.)
  • 3 optimal alignments
  • gaatct gaatct gaatct
  • c-at-t ca-t-t -cat-t

29
Fitting one sequence into another(linear gap
model)
  • Given
  • a scoring scheme described by a substitution
    matrix and a linear gap penalty
  • two sequences of lengths m and n xX1X2Xm and
    yY1Y2Yn, with m? n.
  • Goal
  • find the score of an optimal (gapped) alignment
    of x with a subsequence of y.
  • find an optimal alignment of this type.

30
Restating the goal
  • Notation
  • s substitution matrix (for brevity, we will
    write s(i,j) for s(Xi,Yj)), d linear gap
    penalty
  • for 1? i? m and 1? k? j? n, let x1,iX1X2Xi, and
    yk,jYkY2Yj
  • for two sequences u and v, let B(u,v) be the
    score of the pair (u,v)
  • Goal
  • find maxB(x,yk,j) 1? k? j? n

31
Naïve approach
  • For each choice of k and j, with 1? k? j?n, apply
    the NW algorithm to the sequences x and yk,j to
    find B(x,yk,j). Then find the max.
  • Running time to find the optimal score is
    therefore O(mn3)
  • there are choices for
    k and j
  • for each such choice, the running time of NW to
    find B(x,yk,j) is O(m(j-k)).

32
A better approach
  • For i1,2,,m and j1,2,,n, let
    F(i,j)maxB(x1,i,yk,j) k1,2,,j
  • Then maxB(x,yk,j) 1? k? j? n equals
    maxF(m,j) j1,2,,n, since
  • maxB(x,yk,j) 1? k? j? n
  • maxB(x1,m,yk,j) 1? k? j? n
  • max maxB(x1,m,yk,j) k1,2,,j j1,2,,n
  • maxF(m,j) j1,2,,n

33
Computing the matrix F
  • Initialize
  • F(i,0)-id, for i1,2,,m
  • F(0,j)0, for j0,1,,n, since deletions of the
    beginning of y should be without penalty
  • Fill in the elements of F recursively, proceeding
    from top left to bottom right.
  • Look at the last row of the (m1)?(n1) matrix F.
    The max value on this row is the desired score.
  • Note that there might be more than one cell on
    this row at which the max is attained.

34
Recursive formula
  • F(i,j)maxF(i-1,j-1)s(i,j), F(i-1,j)-d,
    F(i,j-1)-d
  • Thus, to compute maxF(m,j) j1,2,,n, the
    running time is O(mn).
  • To find an alignment that has this score, we must
    keep track, at each step of the recursion, of the
    choice made to get the max.

35
Example
  • xatg, ygaatct (m3 and n6)
  • s(X,Y)1 if XY, -1 otherwise
  • d2

36
Example (cont.)
  • One optimal fit
  • a t g
  • g a a t c t

37
Local alignments(linear gap model)
  • Given
  • a scoring scheme described by a substitution
    matrix and a linear gap penalty
  • two sequences of lengths m and n xX1X2Xm and
    yY1Y2Yn.
  • Goal
  • find the score of an optimal (gapped) alignment
    of a subsequence of x with a subsequence of y.
  • find an optimal alignment of this type.

38
Assumption
  • In what follows we make the assumption that the
    scoring scheme we use is such that the expected
    score for a random alignment is negative.
  • If this assumption did not hold, then long
    matches between subsequences could score highly
    just because of their lengths, so that two long
    unrelated subsequences could give a highest
    scoring alignment. We do not want this to occur.

39
Restating the goal
  • Notation
  • s substitution matrix (for brevity, we will
    write s(i,j) for s(Xi,Yj)), d linear gap
    penalty
  • for 1?h? i?m and 1? k? j? n, let xh,iXhX2Xi,
    and yk,jYkY2Yj
  • for two sequences u and v, let B(u,v) be the
    score of the pair (u,v)
  • Goal
  • find maxB(xh,i,yk,j) 1? h? i? m, 1? k? j?
    n, when this is non-negative.

40
Naïve approach
  • For each choice of h, i, k and j, with 1? h? i? m
    and 1? k? j? n, apply the NW algorithm to the
    sequences xh,i and yk,j to find B(xh,i,yk,j).
    Then find the max.
  • Running time to find the optimal score is
    therefore O(m3n3)
  • there are choices
    for h and i and
  • choices
    for k and j
  • for each such choice, the running time of NW to
    find B(xh,i,yk,j) is O((i-h)(j-k)).

41
A better approachthe Smith-Waterman algorithm
(1981)
  • For i1,2,,m and j1,2,,n, let
  • L(i,j)max0, B(xh,i,yk,j) h1,2,, i,
    k1,2,,j
  • Then
  • max0, B(xh,i yk,j) 1? h? i? m, 1? k? j? n
  • equals maxL(i,j) i1,2,,m, j1,2,,n, since
  • max0, B(xh,i,yk,j) 1? h? i? m, 1? k? j? n
  • max max0, B(xh,i,yk,j) h1,2,,i, k1,2,,j
    i1,2,,m, j1,2,,n
  • maxL(i,j) i1,2,,m, j1,2,,n

42
Computing the matrix L
  • Initialize
  • L(i,0)0, for i1,2,,m
  • L(0,j)0, for j0,1,,n,
  • since deletions at the beginning or end of
    our two sequences should be without penalty
  • Fill in the elements of L recursively, proceeding
    from top left to bottom right.
  • Look at the cell of the (m1)? (n1) matrix L.
    The max value is the desired score.
  • Note that there might be more than one cell in L
    at which this max is attained.

43
Recursive formula
  • L(i,j)max0, L(i-1,j-1)s(i,j), L(i-1,j)-d,
    L(i,j-1)-d
  • Thus, to compute
  • maxL(i,j) i1,2,,m, j1,2,,n
  • the running time is O(mn).
  • To find an alignment that has this score, we must
    keep track, at each step of the recursion, of the
    choice made to get the max.

44
Example
  • X2 X3 - X4 X5
  • Y5 Y6 Y7 Y8 Y9

45
Other gap models
  • Linear gap model
  • simple
  • but often not appropriate for biological
    sequences often it is harder for a gap to open
    than it is for it to extend
  • penalize additional gap steps a little less than
    the first one
  • use a more complicated gap cost ?(?)
  • need to adjust the recurrence relations in the
    dynamic programming algorithms

46
Other gap models (cont.)
  • E.g., for global alignments, need to distinguish
    among alignments of x1,i and y1,j that end with
    Xi aligned to an indel
  • the score of such an alignment will also depend
    on how many symbols in x immediately preceding Xi
    are aligned to indels
  • if the last symbol not aligned to an indel
    preceding Xi is Xk, then the i-k symbols from
    Xk1 to Xi are aligned to indels, and the score
    of a highest scoring alignment is B(k, j)?(i-k)

47
Recursive formula for a general gap model(global
alignment)
  • B(0,0)0, B(i,0)?(i), B(0,j)?(j)
  • B(i,j) maxB(i-1,j-1)s(i,j),
  • B(k,j)?(i-k) k0,1,,i-1,
  • B(i,k)?(j-k) k0,1,,j-1
  • Finding B(m,n) has a running time of O(m2nmn2)
    as, for each cell, one must consider ij1
    previous ones, not just 3.

48
Special case affine gap model
  • ?(?)-d-(?-1)e, 0lteltd dgap open penalty, egap
    extension penalty
  • Trick to get a running time of O(mn) use 3
    matrices
  • S(i,j) holds the score of a highest scoring
    alignment between x1,i and y1,j , given that the
    alignment ends with Xi aligned to Yj
  • Ix(i,j) holds the score of a highest scoring
    alignment between x1,i and y1,j , given that the
    alignment ends with Xi aligned to an indel.
  • Iy(i,j) holds the score of a highest scoring
    alignment between x1,i and y1,j , given that the
    alignment ends with and indel aligned to Yj.

49
Initialization
  • S(0,0) Ix(0,0) Iy(0,0)0
  • S(0,j) Ix(0,j) -d-(j-1)e
  • S(i,0) Iy(i,0) -d-(i-1)e
  • We assume that a deletion will not be followed
    directly by an insertion

50
Recursive formulae
  • S(i,j) maxS(i-1,j-1)s(i,j ),
  • Ix(i-1,j-1)s(i,j),
  • Iy(i-1,j-1)s(i,j)
  • Ix(i,j) maxS(i-1,j)-d, Ix(i-1,j)-e
  • Iy(i,j) maxS(i,j-1)-d, Iy(i,j-1)-e
  • The score of the pair (x,y) is then
  • maxS(m,n), Ix(m,n), Iy(m,n)

51
Limitation of the dynamic programming alignment
algorithms
  • When dealing with very long sequences a time
    complexity as that of the algorithms discussed so
    far might not be good enough
  • e.g. when trying to align a query sequence to all
    sequences in a database
  • Other algorithms, like BLAST, have been
    developed.
  • These use heuristic techniques to limit the
    search to a fraction of the possible alignments
    between two sequences, with an attempt not to
    miss the high-scoring alignments.
  • Trade-off dynamic programming always finds the
    best score, BLAST might miss the best scoring
    alignment in some cases.

52
BLAST(Basic Local Alignment Search Tool)
  • For simplicity here we sketch the ideas behind
    the original (ungapped) algorithm
  • There are variants of this algorithm, also
    accomodating for gapped alignments and
    incorporating dynamic programming ideas
  • The statistical aspects of BLAST will be
    discussed in a subsequent lecture
  • GOAL seek for locally maximal segment pairs with
    scores above some cutoff (HSPs)
  • segment paira pair of subsequences of the same
    length, one from each sequence
  • locally maximalscore cannot be improved either
    by extending or shortening both segments
  • a maximal segment pair (MSP) is also a locally
    maximal one
  • statistical results can be used to estimate the
    highest MSP score S at which chance similarities
    are likely to appear

53
BLAST algorithm outline
  • Confine the attention to segment pairs that
    contain a word pair with a score of at least T,
    for some appropriate choice of T and w
  • word pair a segment pair of fixed length w
  • Compile a list of high-scoring words
  • Search for hits each hit gives a seed
  • Extend seeds
  • Note 1 and (the typical implementation of) 4 are
    the steps that might introduce departure from the
    ideal finding guaranteed MSPs

54
An implementation for protein sequences
  • w fixed at 4 (more recently 3), T chosen to
    balance sensitivity and running time
  • List all w-mers that score at least T with some
    w-mer in the query sequence
  • Scan the database in search for hits (perfect
    matches) in the list constructed in the previous
    step (e.g. using a hash)
  • Extend each hit to economize time the process of
    extending in one direction is terminated when a
    segment pair whose score falls a certain distance
    below the best score found for shorter extensions
    is reached
Write a Comment
User Comments (0)
About PowerShow.com