LCS and Extensions to Global and Local Alignment - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

LCS and Extensions to Global and Local Alignment

Description:

Smith-Waterman Algorithm. Extensions to LCS ... Smith-Waterman (1 of 3) Algorithm. The two molecular sequences will be A=a1a2. ... Smith-Waterman (2 of 3) ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 23
Provided by: lanct
Category:

less

Transcript and Presenter's Notes

Title: LCS and Extensions to Global and Local Alignment


1
LCS and Extensions to Global and Local Alignment
  • Dr. Nancy Warter-Perez
  • June 26, 2003

2
Overview
  • Recursion
  • Recursive solution to hydrophobicity sliding
    window problem
  • LCS
  • Smith-Waterman Algorithm
  • Extensions to LCS
  • Global Alignment
  • Local Alignment
  • Affine Gap Penalties
  • Programming Workshop 6

3
Project References
  • http//www.sbc.su.se/arne/kurser/swell/pairwise_a
    lignments.html
  • Computational Molecular Biology An Algorithmic
    Approach, Pavel Pevzner
  • Introduction to Computational Biology Maps,
    sequences, and genomes, Michael Waterman
  • Algorithms on Strings, Trees, and Sequences
    Computer Science and Computational Biology, Dan
    Gusfield

4
Recursion
  • Problems can be solved iteratively or recursively
  • Recursion is useful in cases where you are
    building upon a partial solution
  • Consider the hydrophobicity problem

5
  • Main.cpp
  • include ltiostreamgt
  • include ltstringgt
  • using namespace std
  • include "hydro.h"
  • double hydro25 1.8,0,2.5,-3.5,-3.5,2.8,-0.4,-
    3.2,4.5,0,-3.9,3.8,1.9,-3.5,0,
  • -1.6,-3.5,-4.5,-0.8,-0.7,0,4.2,-0.9,0,-1.3
  • void main ()
  • string seq int ws, i
  • cout ltlt "This program will compute the
    hydrophobicity of an sequence of amino acids.\n"
  • cout ltlt "Please enter the sequence "ltlt
    flush cin gtgt seq
  • for(i 0 i lt seq.size() i)
  • if((seq.data()i gt 'a') (seq.data()i
    lt 'z'))
  • seq.at(i) seq.data()i - 32
  • cout ltlt "Please enter the window size "ltlt
    flush cin gtgt ws

6
  • Hydro.cpp
  • include ltiostreamgt
  • include ltstringgt
  • using namespace std
  • include "hydro.h"
  • void print_hydro(string seq, int ws, int i,
    double sum)
  • void compute_hydro(string seq, int ws)
  • cout ltlt "\n\nThe hydrophocity values are" ltlt
    endl
  • print_hydro(seq, ws, seq.size()-1, 0)
  • void print_hydro(string seq, int ws, int i,
    double sum)
  • if(i -1)
  • return
  • if(i gt seq.size() - ws)
  • sum hydroseq.data()i - 'A'
  • else
  • sum sum - hydroseq.data()iws - 'A'
    hydroseq.data()i - 'A'
  • print_hydro(seq, ws, i-1, sum)

7
hydro.h extern double hydro25 void
compute_hydro(string seq, int ws)
8
Dynamic Programming
  • Applied to optimization problems
  • Useful when
  • Problem can be recursively divided into
    sub-problems
  • Sub-problems are not independent

9
Longest Common Subsequence (LCS) Problem
  • Reference Pevzner
  • Can have insertion and deletions but no
    substitutions (no mismatches)
  • Ex V ATCTGAT
  • W TGCATA
  • LCS TCTA

10
LCS Problem (cont.)
  • Similarity score
  • si-1,j
  • si,j max si,j-1
  • si-1,j-1 1, if vi wj
  • On board example Pevzner Fig 6.1

11
Indels insertions and deletions (e.g., gaps)
  • alignment of V and W
  • V rows of similarity matrix (vertical axis)
  • W columns of similarity matrix (horizontal
    axis)
  • Space (gap) in W ? (UP)
  • insertion
  • Space (gap) in V ? (LEFT)
  • deletion
  • Match (no mismatch in LCS) (DIAG)

12
LCS(V,W) Algorithm
  • for i 0 to n
  • si,0 0
  • for j 1 to m
  • s0,j 0
  • for i 1 to n
  • for j 1 to m
  • if vi wj
  • si,j si-1,j-1 1 bi,j DIAG
  • else if si-1,j gt si,j-1
  • si,j si-1,j bi,j UP
  • else
  • si,j si,j-1 bi,j LEFT

13
Print-LCS(V,i,j)
  • if i 0 or j 0
  • return
  • if bi,j DIAG
  • PRINT-LCS(V, i-1, j-1)
  • print vi
  • else if bi,j UP
  • PRINT-LCS(V, i-1, j)
  • else
  • PRINT-LCS(V, I, j-1)

14
Classic Papers
  • Needleman, S.B. and Wunsch, C.D. A General Method
    Applicable to the Search for Similarities in
    Amino Acid Sequence of Two Proteins. J. Mol.
    Biol., 48, pp. 443-453, 1970.(http//poweredge.sta
    nford.edu/BioinformaticsArchive/ClassicArticlesArc
    hive/needlemanandwunsch1970.pdf)
  • Smith, T.F. and Waterman, M.S. Identification of
    Common Molecular Subsequences. J. Mol. Biol.,
    147, pp. 195-197, 1981.(http//poweredge.stanford.
    edu/BioinformaticsArchive/ClassicArticlesArchive/s
    mithandwaterman1981.pdf)
  • Smith, T.F. The History of the Genetic Sequence
    Databases. Genomics, 6, pp. 701-707, 1990.
    (http//poweredge.stanford.edu/BioinformaticsArchi
    ve/ClassicArticlesArchive/smith1990.pdf)

15
Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
16
Smith-Waterman (2 of 3)
  • The formula for Hij follows by considering the
    possibilities for ending the segments at any ai
    and bj.
  • If ai and bj are associated, the similarity is
  • Hi-l,j-l s(ai,bj).
  • (2) If ai is at the end of a deletion of length
    k, the similarity is
  • Hi k, j - Wk .
  • (3) If bj is at the end of a deletion of length
    1, the similarity is
  • Hi,j-l - Wl. (typo in paper)
  • (4) Finally, a zero is included to prevent
    calculated negative similarity, indicating no
    similarity up to ai and bj.

17
Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
18
Extend LCS to Global Alignment
  • si-1,j ?(vi, -)
  • si,j max si,j-1 ?(-, wj)
  • si-1,j-1 ?(vi, wj)
  • ?(vi, -) ?(-, wj) -? fixed gap penalty
  • ?(vi, wj) score for match or mismatch can be
    fixed, from PAM or BLOSUM

19
Extend to Local Alignment
  • 0 (no negative scores)
  • si-1,j ?(vi, -)
  • si,j max si,j-1 ?(-, wj)
  • si-1,j-1 ?(vi, wj)
  • ?(vi, -) ?(-, wj) -? fixed gap penalty
  • ?(vi, wj) score for match or mismatch can be
    fixed, from PAM or BLOSUM

20
Discussion on adding affine gap penalties
  • Affine gap penalty
  • Score for a gap of length x
  • -(? ?x)
  • Where
  • ? gt 0 is the insert gap penalty
  • ? gt 0 is the extend gap penalty

21
Alignment with Gap Penalties Can apply to global
or local (w/ zero) algorithms
  • ?si,j max ?si-1,j - ?
  • si-1,j - (? ?)
  • ?si,j max ?si1,j-1 - ?
  • si,j-1 - (? ?)
  • si-1,j-1 ?(vi, wj)
  • si,j max ?si,j
  • ?si,j
  • Note keeping with traversal order in Figure 6.1,
    ? is replaced by ?, and ? is replaced by ?

22
Programming Workshop 6
  • Implement LCS
  • LCS(V,W)
  • b and s are global matrices
  • Print-LCS(V,i,j)
  • Write a program that uses LCS and Print-LCS. The
    program should prompt the user for 2 sequences
    and print the longest common sequence.
Write a Comment
User Comments (0)
About PowerShow.com