Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt

Description:

Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm Semi-global Comparison Semi-global Comparison Local Alignment Local Alignment Find the best fit ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 89
Provided by: dsmiUnisi
Category:

less

Transcript and Presenter's Notes

Title: Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt


1
Algorithms for biological sequence
Comparison and AlignmentSara Brunetti,
Dipartimento di Scienze Matematiche e
Informatica University of Siena,
Italy,sara.brunetti_at_unisi.it





2
History
  • 1953 DNA structure, Watson e Crick
  • 1975 development of the sequencing technique,
    Ranger, Maxam e Gilbert
  • 1980 American Supreme Court decides that bacteria
    genetically modified were patented
  • 1990 beginning of the Genome Project
  • Goals
  • sequence the entire human genome producing the
    complete DNA trascript
  • produce maps of the genome showing locations of
    expressed sites
  • 2000 Tony Blair and Bill Clinton announce the
    completion of the human genome sequencing
  • Cost 3 109 euros

3
  • Amounts of data
  • Human genome 3x109 bp word 5000 km long
    (Roma-Capo Nord)
  • Contained in 1015 cells
  • Macromolecular structures 35460 entry (14 Marz
    06, PDB)
  • Bioinformatics
  • study of problems of storage, organization and
    distribution of large amounts of genomic data

4
Problems
  • Reconstructing long sequences of DNA from
    overlapping sequence fragments.
  • Storing and retrieving DNA sequences in
    Databases.
  • Searching databases for related sequences and
    subsequences.

5
  • The human genome project is an international
    multi-component effort to read out the
    instruction book for human biology that is our
    genome, and to understand what its telling us
  • Biological problems
  • Determine genetic maps from probe data
  • Determine the functionality of a protein
  • Characterize a protein
  • Prediction of protein structure
  • Protein interaction

6
Computational biology
  • study of mathematical and combinatorial problems
    of modeling biological processes in the cell,
    interpreting the data and providing theories
    about their biological relations
  • Data representation
  • Problem formulation
  • (Efficient) algorithm design

7
Data representation
  • Alphabet
  • Italian
  • A B C D E F G H I L M N O P Q R S T U V Z
  • English
  • A B C D E F G H I J K L M N O P Q R S T U V W Y Z
  • DNA
  • A C G T
  • Protein
  • A Q W E R T Y I P L K H F D S C V N M
  • Binary
  • 0 1

8
Data representation strings
  • DNA
  • String ACCGTATATAAAAGGCCGGGTT
  • Length 22

suffix
substring
prefix
9
Similarities and differences?
  • -Differences between the human genome and the
    chimpanzee genome 2
  • -Differences betweeen human and worm 50
  • -Similarity between two humans 99,9
  • But genome length 3 109 bp
  • They can differ into 3 106 positions

10
Similarities and differences
  • The resemblance of two DNA sequences taken from
    different organisms can be explained by the
    theory that all contemporary genetic material has
    one ancestral ancient DNA.
  • According to this theory, during the course of
    evolution mutations occurred, creating
    differences between families of contemporary
    species.
  • Most of these changes are due to local mutations,
    each modifying the DNA sequence at a specific
    manner.

11
Biological motivation
  • Learning about the functionality or structure of
    a protein without performing any experiments
  • Basic idea
  • In biomolecular sequences (DNA, RNA,
    Aminoacid sequences) high similarity usually
    implies significant functional or structural
    similarity.
  • Usually 25 sequence identity suffice two
    proteins to have same 3-dim structure and almost
    identical function

12
  • WARNING Sequence similarities implies functional
    similarities, but the reverse is not necessarily
    true!
  • Beside sequences other levels to enquire 3D
    protein structure, cellular biochemistry or
    morphology etc., but sequences are easier to
    study.

13
DNA Sequence Comparison First Success Story
  • Finding sequence similarities with genes of known
    function is a common approach to infer a newly
    sequenced genes function
  • In 1984 Russell Doolittle and colleagues found
    similarities between cancer-causing gene and
    normal growth factor (PDGF) gene

14
Cystic Fibrosis
  • Cystic fibrosis (CF) is a chronic and frequently
    fatal genetic disease of the body's mucus glands
    (abnormally high level of mucus in glands). CF
    primarily affects the respiratory systems in
    children.
  • Mucus is a slimy material that coats many
    epithelial surfaces and is secreted into fluids
    such as saliva

15
Cystic Fibrosis Inheritance
  • In early 1980s biologists hypothesized that CF is
    an autosomal recessive disorder caused by
    mutations in a gene that remained unknown till
    1989
  • Heterozygous carriers are asymptomatic
  • Must be homozygously recessive in this gene in
    order to be diagnosed with CF

16
Cystic Fibrosis Finding the Gene
17
Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins
  • ATP binding proteins are present on cell membrane
    and act as transport channel
  • In 1989 biologists found similarity between the
    cystic fibrosis gene and ATP binding proteins
  • A plausible function for cystic fibrosis gene,
    given the fact that CF involves sweet secretion
    with abnormally high sodium level

18
Cystic Fibrosis Mutation Analysis
  • If a high of cystic fibrosis (CF) patients have
    a certain mutation in the gene and the normal
    patients dont, then that could be an indicator
    of a mutation that is related to CF
  •  
  • A certain mutation was found in 70 of CF
    patients, convincing evidence that it is a
    predominant genetic diagnostics marker for CF

19
Cystic Fibrosis and CFTR Gene
20
Cystic Fibrosis and the CFTR Protein
  • CFTR (Cystic Fibrosis Transmembrane conductance
    Regulator) protein is acting in the cell membrane
    of epithelial cells that secrete mucus
  • These cells line the airways of the nose, lungs,
    the stomach wall, etc.

21
Mechanism of Cystic Fibrosis
The CFTR protein (1480 amino acids) regulates a
chloride ion channel Adjusts the wateriness of
fluids secreted by the cell Those with cystic
fibrosis are missing one single amino acid in
their CFTR Mucus ends up being too thick,
affecting many organs
22
Bring in the Bioinformaticians
  • Gene similarities between two genes with known
    and unknown function alert biologists to some
    possibilities
  • Computing a similarity score between two genes
    tells how likely it is that they have similar
    functions

23
Sequence Comparison Problems
  • Informally find which parts of sequences are
    alike and which parts are different.
  • Given two sequences over the same alphabet, about
    of the same length (?10.000 char.), and almost
    equal, find the places where differences occur.
  • Problem 1) the same gene is sequenced by two
    laboratories and they want to compare the
    results.
  • 2) Given two sequences with a few hundred of
    char., find two similar sub-strings (one from
    each sequence).
  • 3) Same as Problem 2), but one sequence is
    compared with thousand of others.
  • Problems 2), 3) in searching local similarities
    in large databases of bio-sequences.

24
Sequence Comparison Problems
  • 4) Given two sequences with a few hundred of
    char., find a prefix of one similar to the suffix
    of the other.
  • Problem 4) in the fragment assembly procedure in
    large scale DNA sequencing.
  • We introduce a single basic algorithmic idea to
    solve all the above problems.

25
Pairwise alignment
  • How to compare two sequences?
  • Alignment
  • Similarity

26
Sequence alignment an example
s ATGCAGCTGAGCATCG
t ATACAGCGAGTATCG
27
Sequence alignment an example
s ATGCAGCTGAGCATCG
t A
T
A
C
A
G
C
G
A
G
T
A
T
C
G
28
Sequence alignment
  • Sequence 1 s(s1,,sm) of size m
  • Sequence2 t(t1,,tn) of size n
  • An alignment (s,t) between s and t is obtained
    by insertion of spaces in arbitrary positions
    along the sequences so that they end up with the
    same size
  • s1 s2 sl
  • t1 t2 tl
  • (si,ti) pair of characters in s and t or -
  • Not allowed (-,-)

29
Longest path in a network
  • The nodes of the network are (i,j) where each
    (i,j) node denotes an aligned pair (si,tj).
  • There are edges between nodes (i,j) and (k,l) if
    i lt k and j lt l (order preserving)
  • We add a source and a sink
  • (The weights on the edges are given by function p)

A
T
A
G
C
0 1 2 2 3 3 4
A



s A T G A t T A
G C A
T
G
A
0 0 1 2 3 4 5
30
The Alignment Grid
  • Every alignment path is from source to sink

31
Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2 Computing Hamming distance
Computing edit distance is a trivial task
is a non-trivial
task
32
Edit Distance Example
  • TGCATAT ? ATCCGAT in 5 steps
  • TGCATAT ? (delete last T)
  • TGCATA ? (delete last A)
  • TGCAT ? (insert A at front)
  • ATGCAT ? (substitute C for 3rd G)
  • ATCCAT ? (insert G before last A)
  • ATCCGAT (Done)
  • What is the edit distance? 5?


33
Edit Distance Example (contd)
TGCATAT ? ATCCGAT in 4 steps TGCATAT ? (insert
A at front) ATGCATAT ? (delete 6th T) ATGCATA ?
(substitute G for 5th A) ATGCGTA ? (substitute
C for 3rd G) ATCCGAT (Done) Can it be
done in 3 steps???
34
Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- Corresponding path -
35
Alignments in Edit Graph (contd)
  • and represent indels in v and w with
    score 0.
  • represent matches with score 1.
  • The score of the alignment path is 5.

36
Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
37
Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
38
Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
39
Sequence similarity an example
ATGCAGCTGAGCATCG ATACAGC GAGTATCG
  • We assign a score to each alignment
  • p(a,a) 1
  • p(a,b) -1
  • p(a,-)p(-,a)g -2
  • Score of the example 13 2 1-1 9
  • The best alignment between s and t is the one
    with maximal total score (similarity).

40
Global alignment
41
Global alignment
  • Given two sequences s and t of roughly the same
    length, determine the alignment of s and t with
    maximal score
  • AC - GCTTTG
  • - CATG TAT-
  • (NeedlemanWunsch Algorithm)
  • Motivation the same gene is sequenced by two
    laboratories and they want to compare the results

42
Number of alignments
  • How many ways s can be aligned with t?
  • s1 s2 sl
  • t1 t2 tl
  • Max(n,m) lt l lt nm
  • s1 .. sm - - - -
  • - - - - - t1 tn
  • - s1 .-. sm -
  • t1 . tn
  • f(i,j)alignments of one sequence of i letters
    with another of j letters
  • f(n,m)f(n-1,m)f(n-1,m-1)f(n,m-1) and
    f(n,n)?(1?2)2n1 ?n as n??

43
  • Es. two sequences of length 1000 have the
    following number of possible alignments
  • f(1000,1000)?(1?2)2001 ?100010 767,4.. !!!!!!!!
  • (there are 10 80 elementary particles in the
    universe)

44
Dynamic Programming Algorithm
  • Basic Idea of Dynamic Programming a problem is
    solved taking advantage of the already solved
    sub-problems.
  • Each optimal alignment contains optimal
    alignments of the subproblems (example)
  • - GCTGATATAGCT
  • GGGTGAT -TAGCT
  • Additivity of the penalty function
  • Three essential components
  • Recurrence relation
  • Tabular computation
  • Traceback

45
Dynamic Programming Algorithm Recurrence
relation
  • Sequence 1 s of size m
  • Sequence 2 t of size n
  • si..j sub-string from char i to
    char j of s.
  • M(i,j) is the score of the best alignment between
    s1..i and t1..j
  • M(j,0) M(0,j)-2j
  • M(m,n) is computed by solving the more general
    problem of computing M(i,j) for all i,j
  • No top-down approach, but bottom up
  • The computation is arranged in a (m1) (n1)
    array M

Mi,j-1 - 2
1, if si tj Mi,j max
Mi-1,j-1 p(i,j), p(i,j)
Mi-1,j -2
-1, if si ? tj
46
Dynamic Programming Algorithm Tabular computation
s t A G C
A A A C
row 0 comparison between t and an empty
sequence. column 0
0
-2
-4
-6
comparison between s and an empty sequence

1 -4 -4 1
-3 -6 -1 1
-2
-3
Mi,j is computed by observing the 3 previous
entries Mi-1,j-1, Mi,j-1 and Mi-1,j.
M
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Mi-1,j-1 a new char of s and a new char of t
are considered 1 is added in
case of match and 1 in case of mismatch.
Align s1..i-1 with t1..j-1 Mi,j-1 a
new char of the sequence t is considered
corresponding to a space in s
(-2). Align s1..i with t1..j-1 and
match a space with tj Mi-1,j a new char of
the sequence s is considered corresponding
to a space in t (-2). Align
s1..i-1 with t1..j and match a space with si
47
Dynamic Programming Algorithm- Traceback
Trace back to find the best alignment(s)
solution2
t A G C
s
0
-2
-4
-6
A A A C
1
1 -3
-2
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Best Score
-1
48
Algorithm Similarity
  • input S,T,m,n
  • output M
  • for i ?1 to m do
  • M(i,0) ?i g
  • for j ?0 to n do
  • M(0,j) ? j g
  • for i ? 1 to m do
  • for j ?1 to n do
  • M(i,j) ? max( M(i-1,j)g
  • M(i-1,j-1)p(i,j)
  • M(i,j-1)g )
  • return M
  • Complexity O(nm)

49
Align (i, j, len)input i, j, array
M obtained by Similarity Alg.output alignment
in align-s align-t, vectors of length lenif i
0 and j 0 then len 0 else if i gt 0 and
Mi, j Mi-1, j cs then Align
(i-1,j,len) len len1 align-s
si align-t (space) else if i gt
0 and jgt0 and Mi, j Mi-1, j-1 c(i,j)
then Align (i-1,j-1,len) len
len1 align-s si align-t tj
else Align (i,j-1,len) len len1
align-s (space) align-t tj
  • First call Align(m,n,len)
  • max (s,t) ? len ? m n
  • Algorithm Align finds solution1.
  • By inverting the order of the if statements it
    is possible to find
  • the other solutions.

50
Complexity
  • Algorithm Similarity takes O( m n) time and
    space
  • Algorithm Align takes O (m n) time
  • Let hmn
  • T(h) k for h ? 2
  • T(h) T(h-1) k, for h gt 2 (k
    constant)
  • T(h) O(h) O(mn)
  • Algorithm Similarity can be refined to run with
    O(mn) space.
  • In a row by row computation store the last
    and the current
  • row only.
  • Algorithm Align can be designed to run with
    O(mn) space with a
  • divide and conquer strategy.
  • It is not a trivial task!

The basic algorithm Similarity can be modified
to solve a variety of different problems!!
51
More about similarity and distance
52
Similarity and distance
  • Two approaches to comparing strings
  • Similarity measures how much the strings are
    alike
  • Its definition derives from the concept of one
    ancestral ancient DNA
  • An alignment (s,t) of the strings s and t is
    obtained by inserting space characters in them in
    such a way that
  • st
  • Removal of - from s gives s
  • Removal of - from t gives t
  • For every i, either si or ti is not
  • A scoring system (p,g) has members
  • pAxA-gtR, glt0
  • additive scoring
  • sim(s,t)max score(s,t)

53
Similarity and distance
  • Distance measures how much the strings differ
  • Its definition derives from the concept of
    mutations
  • A distance d on E is dExE-gtR
  • d(x,x)0 for all x in E and d(x,y)gt0 for xltgty
  • d(x,y)d(y,x) for all x,y in E
  • d(x,y)ltd(x,z)d(y,z) for all x,y in E
  • An allignment is obtained by successive
    applications of a number of admissible operations
    transforming s into t
  • substitution a-gtb
  • insertion or deletion of any character (indel)
  • A cost measure (c,h) has members
  • cAxA-gtR, hgt0

54
  • When are similarity and distance algorithms
    equivalent?
  • When sequences are aligned by distance in global
    alignment, there is a similarity algorithm that
    gives the same set of optimal alignments, and
    vice versa
  • The measures are related by the formula
  • p(a,b)M-c(a,b) g-hM/2
  • dist(s,t)sim(s,t)M/2(st)
  • Es.
  • Edit distance, M0gt p(a,a)0, p(a,b)-1, g-1
  • M2gt p(a,a)2, p(a,b)1,
    g0
  • Same set of optimal solutions, different scores.
  • Usually 0ltMltmax c(a,b)

55
  • aligned letterslt-gtsubstitutions l
  • spaceslt-gtindel operations r
  • score(s,t) ?il p(ai,bi)rg
  • cost(s-gtt) ?il c(ai,bi)rh
  • score(s,t)cost(s-gtt)lMr M/2
  • if global alignment st2lr,
  • score(s,t)cost(s-gtt)M/2(st)
  • dist(s-gtt)min(M/2(st)- score(s,t))
  • M/2(st)- sim(s,t)

56
Semi-global alignment
57
Semi-global Comparison
  • Find the best fit of a short sequence t of size n
    into a larger sequence s of size m
  • s1 sk
    sl sm
  • s
  • t
  • The solution to this problem as formulated above
    will take time proportional to
  • ?k1..m ?lk..m n(l-k)O(nm3)

58
(Exact matching)
  • Problem given a pattern p and a larger string s,
    find all the occurrences of the pattern p in s
  • Is there an occurrence?
  • How many times p occurrs in s?
  • Naive method
  • Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm

59
Semi-global Comparison
Ignore the spaces at the beginning and at the end
of a sequence. Problem Find the highest score
semi-global alignment between t and substring
(prefix of a suffix) of s. s
CAGCA -CTTGG ATTCTCGG t - - -
CAGCTTGG(- - - - - - - - 1. Ignore final spaces.
Find the best score between t and a prefix of
s.
Mi,j of problem1 contains the best score
between s1..i and t1..j, hence take the
maximum value Mi,n in the last column n.
There is no need to reach the last row.
60
Semi-global Comparison
s CAGCA -CTTGG ATTCTCGG t - - -)CAGCTTGG - - -
- - - - - 2. Ignore initial spaces Find the
best alignment between t and a suffix of
s. Mi,j now contains the best score between
t1..j and a suffix of s1..i, hence in the
first column we have all zeroes.
C A G C . C
0 A 0 G 0
C 0 1 A
Initial char !!Join solutions 1 and 2 to
solve semi-global
comparison!!
61
Local Alignment
62
Local Alignment
  • Find the best fit between a sub-string of s and
    a sub-string of t.
  • s1 sk
    si sm
  • s
  • t
  • t1 th tj
    tn
  • Motivation Ignore streches of non-coding DNA

63
Local Alignment Algorithm SmithWaterman
Global Alignment Local Alignmentbetter
alignment to find conserved segment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC


AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac

aattgccgccgtcgttttcagCAGTTATGTCAGatc
64
Local Alignment Example
Local alignment
Global alignment
65
Local Alignment Algorithm SmithWaterman
  • The LA problem is still solved computing M.
  • Mi,j holds the value of the best alignment
    between a suffix
  • of s1..i and a suffix of t1..j.
  • The first row and the first column are
    initialized with zeros.

66
Local Alignment
Mi,j-1 - 2
1 if si tj Mi,j max Mi-1,j-1
p(i,j), p(i,j)
Mi-1,j -2 -1 if si ?
tj 0
For any entry Mi,j there exists always the
alignment between the empty suffixes of s1..i
and of t1..j with score 0 At the end choose the
entry Mi,j with maximal score in any position.
Start align tracing back, as before, from there
until you find a value 0.
67
Example
HEGAWGHEE PAW HEAE
68
End free-space alignment -Motivation
  • Find the best fit of substrings of s and t, where
    at least one of these substrings must be a prefix
    of the original string and one must be a suffix?
  • Motivation in the shotgun sequence assembly
    procedure, one has a large set of partially
    overlapping substrings that come from many copies
    of one original but unknown DNA sequences.
  • The problem is to use comparisons of pairs of
    substrings to infer the correct original string.

69
End free-space alignment
70
Example
HEGAWGHEE PAW HEAE
71
Kinds of Alignment
72
Complexity of Alignments
The space complexity could be a critical
bottleneck. How we can improve such a
complexity? Linear-Space Alignment Hirschbergs
algorithm -- Miller and Myers algorithm
73
Extensions to the basic algorithm
  • Hirschbergs linear space method for alignment
    uses a divide-et-conquer strategy

74
  • The computation of sim(s,t) can be easily done in
    linear space
  • Each row (or column) depends only on the
    preceding one
  • Optimal alignment in linear space?
  • Divide et conquer strategy divide the problem
    into smaller subproblems and combine their
    solutions to the solution of the whole problem

temp
0 A G C
old
0 AAA C
0
-2
-4
-6
a0
1
-2
1 -3
BestScore(s1..m,t1..n,a)
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
75
  • The idea is to consider the possible
    configurations of an alignment
  • 1.
  • (s1..i-1 si si1..m
  • - - t1..n
  • for j0)
  • s1..i-1 si si1..m
  • t1..j-1 tj tj1..n
  • for 1ltjltn-1
  • (s1..i-1 si si1..m
  • t1..n - -
  • for jn)
  • 2.
  • s1..i-1 si si1..m
  • t1..j - tj1..n
  • for oltjltn

76
  • An optimal alignment must satisfied also
  • Optimal(s1..i-1, t1..j-1) (si,tj)
    Optimal(si1..m,tj1..n)
  • for 0ltjltn, in case 1
  • or
  • Optimal(s1..i-1, t1..j) (si,-)
    Optimal(si1..m,tj1..n)
  • for 0ltjltn, in case 2
  • For a fixed i,
  • align s1..i-1 with a prefix of t, and si1..m
    with a suffix of t, for 0ltjltn
  • Take im/2 M(m,n)maxj ( M(m/2,j)Mr(m/2,n-j) )

77
Recursive algorithm
  • Compute BestScore(s1..m/2,t1..n,a)
    -gtM(m/2,n)
  • Compute BestScore(sm/2..m,tn..1,b)
    -gtMr(m/2,n)
  • Find the column k so that
  • M(m/2,k)Mr(m/2,n-k)M(m,n)
  • Recursively partition the problem to two
    subproblems
  • Find the path from (0,0) to (m/2,k)
  • Find the path from (m,n) to (m/2,n-k)
  • Space complexity min(m,n)
  • At each step save a and b, and the traceback
    pointers for the cells in a and b, and then for
    (m/2,k) only
  • Time complexity T(m,n)number of times a best
    score is computed
  • T(m,n)2T(m,n)T(m/2,k)T(m/2,n-k)O(mn)

78
k
m/2
m/2
n-k
79
Gap penalty
80
Gap penalty function
Gap consecutive number (kgt1) of spaces. From
Biology we know that when mutations are involved,
gap of k spaces are more probable than k isolated
spaces. One concrete example is given by the
c-DNA matching. In the previous problems the cost
w(k) of k internal consecutive spaces was
proportional to k, w(k) k g. Now w(k) h
kg where h g is the cost of the first space
of a gap and g
the cost of the following ones, kgt1.
CA-----CTTGG
hg
g
w(k) h 5g
g
g
g
gap
81
  • Attention!
  • The scoring system is no more additive, i.e. we
    cannot break an alignment in two parts and expect
    the total score to be the sum of the partial
    scores
  • AAC- - -AATTC C GACTAC
  • ACTACCT - - - - - - CGC - -
  • The scoring of an alignment is done at the block
    level

82
Similarities with gap
We need three matrices a, b, c, with the
following meaning ai,j maximum score of an
alignment between s1..i and t1..j
where si is matched with tj. bi,j maximum
score of an alignment between s1..i and t1..j
that ends in a - aligned with
tj. ci,j maximum score of an alignment
between s1..i and t1..j that ends
in si aligned with a - . Where
ai-1,j-1
ai,j p(i,j) max
bi-1,j-1
ci-1,j-1
ai,j-1 -(hg)
ai-1,j -(hg) bi,j max bi,j-1-g
ci,j max
(bi-1,j-(hg) ) (ci,j-1
-(hg))
ci-1,j-g
First space
83
Initialization a0,0 0, ai,0 - ? for 0 ?
i ? m, a0,j - ? for 0?j ? n bi,0 - ?
for 0 ? i ? m b0,j -(hgj) for 0 ? j ?
n ci,0 -(hgi) for 0 ? i ? m c0,j - ?
for 0 ? j ? n

am,n Final result
Get the maximum among bm,n

cm,n
Trace back to obtain the optimal alignment,
remembering the current position and which array
belongs to. Time
O(mn) Space 3(mn) O(mn)
84
Other gap penalty models
  • Constant.
  • Affine.
  • Convex
  • each additional space in a gap contributes less
    to the gap weight than the previous space (ex.
    Log(q))
  • the problem is solvable in O(nm log(m)) time
  • Arbitrary
  • Any gap weight function is acceptable
  • the problem is solvable in O(nm (mn)) time

85
Multiple alignment
86
Multiple Alignment
  • Compare multiple sequences s1, s2, . . . , sk
    over the same alphabet
  • Insert spaces to make them of the same size.
  • MQPI LLL
  • M LR - LL-
  • MK - I LLL
  • MP P V LIL
  • no column is made exclusively of spaces
  • SCORING IS MORE COMPLEX!

87
Motivations
  • Fact evolutionary and functionally related
    molecular strings can differ significantely and
    yet preserve the same 3-d and 2-d structure
  • It is a very important tool to extract and
    represent biologically important similarities
    (critical common patterns) from a set of several
    strings.
  • Critical patterns may suggest evolutionary
    history, a common structure of the protein
    product, or a common biological function.
  • Since critical patterns can be widely dispersed
    or faint, they may be not apparent in the
    comparison of two strings (ex. hemoglobin).
  • Such patterns are used to characterize families
    of proteins (famility representations)
  • These representations can be used to identify
    other potential members of the family

88
SP(Sum of Pairs)-measure
  • Possible Scoring Function
  • SP-function Sum of pair-wise scores of all
    pairs of symbols in the
  • column
  • SP-score (I, -, I, V) p(I,-) p(I,I) p(-,V)
    p(I,V) p(-,I) p(I,V)
  • p(a, b) pair-wise score of char a and b. a
    or/and b can be space.
  • p(- ,-) 0 (common practice).

89
From multiple alignment Alg. We can also derive
pairwise alignment by just removing columns made
of all spaces. 1.
PEAALYGRFT - - - IKSDUW
2. PEAALYGRFT - - - IKSDUW
3. PESLAYNKF - - -S IKSDUW
4. PEALNYG RY - - -SSESDUW
5. PEALNYGWY - - -SSESDUW
6. PEVIRMQDD
NPFSFQSDYY Projection
2. PEAALYGRF T- - - IKSDUW
4. PEALNYGRY - - -SSESDUW Property if
p(-,-) 0, then for a multiple alignment ?, we
have SP-score (?) ??
score (?ij)
iltj Where ?ij is the pairwise
alignment induced by ? on si and sj
90
Combining Optimal Pairwise Alignments into
Multiple Alignment
In the optimal multiple alignment is not true
that induced pairwise alignments ?ij are
optimal for all i,j
91
Compute Optimal Alignment (O.A.) according to
SP-score
Dynamic Programming algorithm on k-dimensional
array M of size n x n x n nk for k sequences
  • Mi1, i2, , ik holds the score of the O.A. of
    s1 .. i1, , s1 .. ik.
  • Each entry of M depends on 2k -1 other entries,
    one for each possible configuration of the
    current column.
  • A A A A -
  • B B B B .. -
    forbidden configuration
  • C C - - -
  • D - D - -
  • To compute SP-score O(k2) steps are required to
    sum the scores
  • of the k(k-1) pair-wise alignment.
  • In total O(nk 2k k2) . O.A. is
    NP-complete(WangJiang)

92
We develop the exact algorithm only for k3.
Consider 3 sequences s1, s2, s3 of size n1, n2,
n3. Initialization M 0,0,0
0 M i,j,0 sp-score(i,j)
(i j) p(x,-) M i,0,k
sp-score(i,k) (i k) p(x,-)
M 0,j,k sp-score(j,k) (j k) p(x,-)
MQPILLL
MQP MLR- LL-
M3,3,0 MLR
MK- ILLL - - -
c(M,M)c(Q,L)c(P,R)c(M,-) c(Q,-) c(P,-)
c(M,-)
c(L,-) c(R,-).
93
Mi,j,k is the maximum value among 7 values
S
A
  • V S N S
  • S N A
  • - - - A S

A
N
S
S
V
S
N
94
Multiple Alignment Projections
A 3-D alignment can be projected onto the 2-D
plane to represent an alignment between a pair of
sequences.
All 3 Pairwise Projections of the Multiple
Alignment
95
Algorithm Optimal Alignment of 3 sequences
For i 1 to n1 for j 1 to n2
for k 1 to n3 if s1(i) s2(j) then
c(i,j) cmatch else
c(i,j) cmis if s1(i) s3(k) then c(i,k)
cmatch else c(i,k)
cmis if s2(j) s3(k) then c(j,k) cmatch
else c(j,k) cmis
m1 Mi-1,j-1,k-1 c(i,j) c(i,k)
c(j,k) m2 Mi-1,j-1,k c(i,j) 2csp
m3 Mi-1,j,k-1 c(i,k) 2csp 1
space on 1 sequence m4 Mi,j-1,k-1
c(j,k) 2csp m5 Mi-1,j,k 2csp
m6 Mi,j-1,k 2csp 1 space on two
sequences m7 Mi,j,k-1 2csp
Mi,j,k max (m1, , m7)
O(n1n2n3)

96
Heuristic Algorithms
  • Star alignment (do not yield any guarantee of
    quality of the resulting)

97
Star Alignment-Gusfield
  • Build a multiple alignment ? based upon the
    pairwise alignments between a fixed sequence, the
    center of the star, and the others
  • is such that its projections ?ij are optimal when
    either i or j is the center
  • Select one of the sequences as center, say sc
  • Compute an optimal alignment between sc and si,
    for all i
  • Merge the alignments by inserting spaces in the
    sequences

98
s4
s3
s5
sc
s2
sk
  • How selecting the center sc?
  • Try with all and then pick the best
  • Compute all the pairwise alignments and take the
    best
  • How inserting the spaces? once a gap in sc,
    always a gap
  • Example
  • Let s1 be the center
  • ATTGCCATT
  • ATGGCCATT
  • ATCCAATTTT
  • ATCTTCTT
  • ACTGACC

s1
99
  • ATTGCCATT
  • ATGGCCATT
  • ATTGCCATT - -
  • ATC- CAATTTT
  • ATTGCCATT - -
  • ATCTTC TT - -
  • ATTGCCATT - -
  • ACTGACC - - - -
  • ATTGCCATT - -
  • ATGGCCATT - -
  • ATC- CAATTTT
  • ATCTTC TT - -
  • ACTGACC - - - -

100
  • Define the score in terms of distance d instead
    of similarity the triangle inequality holds
  • Complexity
  • Computing the center as the best sequences
    minimizing ?id(Si,Sc), takes O(k2 n2)
  • Computing and merging the pairwise alignment
    takes
  • ?i1 k-1O((i n)n)O(k2 n2)

101
  • Let M be denotes the optimal multiple alignment
    and M the computed
  • multiple alignment
  • d(Si,Sj) is the score of the pairwise alignment
    of Si,Sj induced by M
  • d(M) is the sum of d(Si,Sj) for iltj
  • Theorem d(M) ? (2-2/k)d(M)lt2 d(M)
  • Computational experiments show that the
    approximation is overly pessimistic
  • Finding solutions are within 15 from the optimum
    on average

102
The profile
  • Given a multiple sequence alignment M involving N
    sequences of length l a profile P is a l ?
    S?- matrix whose columns are probability
    vectors denoting the frequencies of each symbol
    in the corresponding alignment column.
  • Example
  • A-GTTTA
  • AC-TTA--
  • ACGTAAG-
  • -C-TA---

103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Aligning a string to a profile
  • Given a profile P and a string S, how well does S
    fit P?
  • P can be considered as a string of profile
    columns positions, and aligning S to P consists
    in inserting spaces into S and P.
  • Scoring a char of S to every char in a column of
    P
  • Example a,a -gt2 a,- and a,b -gt -1 a,c-gt -3
  • S aabbc P
  • s(a,1)0.75x2 0.25x3
  • is the contribute of the first column of P
  • A dynamic programming algorithm permits to
    compute the optimal al.

A .75 .25 .50 B .75
.75 C .25 .25 .50
.25 - .25 .25 .25
107
Align a sequence S vs. a profile P
  • Let P (pij) for i1l j1 S1 be a profile
    and let Ss1sn be a sequence.
  • The following scoring function can be defined
  • ssp (S?-) ?1,2,,l ? R
Write a Comment
User Comments (0)
About PowerShow.com