Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

This chapter introduces inexact matching. Inexact matching is used to compute similarity. ... We will consider a dynamic programming approach to inexact matching. ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 37

Provided by: john244

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 11 Core String Edits
Lecturer Dr. Rose
Slides by Dr. Rose
January 30 February 4, 2003

2
Core String Edits

This chapter introduces inexact matching
Inexact matching is used to compute similarity.
Sequences similarity is a key concept.
Sequence similarity implies
Structural similarity
Functional similarity
We will consider a dynamic programming approach
to inexact matching.

3
Edit Distance

One measure of similarity between two strings is
their edit distance.
This is a measure of the number of operations
required to transform the first string into the
other.
Single character operations
Deletion of a character in the first string
Insertion of a character in the first string
Substitution of a character from the second
character into the second string
Match a character in the first string with a
character of the second.

4
Edit Distance

Example from textbook transform vintner to
writers
vintner replace v with w ? wintner
wintner insert r after w ? wrintner
wrintner match i ? wrintner
wrintner delete n ? writner
writner match t ? writner
writner delete n ? writer
writer match e ? writer
writer match r ? writer
writer insert s ? writers

5
Edit Distance

Let S I, D, R, M be the edit alphabet
Defn. An edit transcript of two strings is a
string over S describing a transformation of one
string into another.
Defn. The edit distance between two strings is
defined as the minimum number of edit operations
needed to transform the first into the second.
Matches are not included in the count.
Edit distance is also called Levenshtein distance.

6
Edit Distance

Defn. An optimal transcript is an edit transcript
with the minimal number of edit operations for
transforming one string into another.
Note optimal transcripts may not be unique.
Defn. The edit distance problem entails computing
the edit distance between two strings along with
an optimal transcript.

7
String Alignment

Defn. A global alignment of strings S1 and S2 is
obtained by
Inserting dashes/spaces into or at the ends of S1
and S2.
Aligning the two strings s.t. each
character/space in either string is opposite a
unique character/space in the other string.
Example 1 S1 qacdbd S2 qawxb
q a c - d b d
q a w x - b -

8
String Alignment

Example 2 S1 vintner S2 writers
v - i n t n e r -
w r i - t - e r s
Mathematically, string alignment and edit
transcripts are equivalent.
From a modeling perspective they are not
equivalent.
Edit transcripts express the idea of mutational
changes.

9
Dynamic Programming

Observation 1 There are many possible ways to
transform one string into another.
Observation 2 This is like the knapsack problem
Recall dynamic programming is used to solve
knapsack-like problems.
Defn. Let D(i,j) denote the edit distance of
S11..i and S21..j.
That is, D(i,j) is the minimum number of edit ops
needed to transform the first i characters of S1
into the first j characters of S2.

10
Dynamic Programming

Notice that we can solve D(i,j) for all
combination of lengths of prefixes of S1 and S2.
Examples D(0,0),.., D(0,j), D(1,0),..,D(1,j),
D(i,j)
Dynamic programming is a divide and conquer
method.
The three parts to dynamic programming are
The recurrence relation
Tabular computation
Traceback

11
Dynamic Programming

The recurrence relation expresses the recursive
relation between a problem and smaller instances
of the problem.
For any recursive relation, the base condition(s)
must be specified.
Base conditions for D(i,j) are
D(i,0) i
Q Why is this true? What does it mean in terms
of edit ops?
D(0,j) j
Q Why is this true? What does it mean in terms
of edit ops?

12
Dynamic Programming

The general recurrence is given by
D(i,j) minD(i - 1, j) 1, D(i, j - 1) 1,
D(i - 1, j - 1) t (i,j)
Here t (i,j) 1 if S1(i) ? S2(j), o/w t (i,j)
0.
Proof of correctness on Pages 218-219
Basic argument D(i,j) must be one of
D(i - 1, j) 1
D(i, j - 1) 1
D(i - 1, j - 1) t (i,j)
There are NO other ways of creating S21..j from
S11..i.

13
Dynamic Programming

Q How do we use the recurrence relation to
efficiently compute D(i,j) ?
Wrong Answer simply use recursion.
Q Why is this the wrong answer?
A recursion results in inefficient duplication
of computations for subproblems.
Q How much duplication?
A Exponential duplication!
Example Fibonacci numbers

14
Dynamic Programming

Example Fibonacci numbers
f(n) f(n - 1) f(n - 2)
Base conditions f(0) 0, f(1) 1

15
Dynamic Programming

Note In calculating D(n,m), there are only (n
1) ? (m 1) unique combinations of i and j.
Clearly an exponential number of computations is
NOT required.
Soln instead of going top-down with recursion,
go bottom-up. Compute each combination only once.
Decide on a data structure to hold intermediate
results.
Start from base conditions. These are the
smallest D(i,j) values and are already defined.
Compute D(i,j) for larger values of i and j.

16
Dynamic Programming

Example Fibonacci numbers

Decide on a data structure simple array

Start from base conditions f(0) 0, f(1) 1

Compute f(i) for larger values of i. From bottom
up.

Each f(i) is computed only once!

17
Dynamic Programming

Q What kind of data structure should we use for
edit distance?
Has to be a random access data structure.
Has to support the dimensionality of the problem.
D(i,j) is two-dimensional S1 and S2.
We will use a two-dimensional array, i.e., a
table.

18
Dynamic Programming
Example edit distance from vintner to
writers. Fill in the base condition values.
19
Dynamic Programming

Q How do we fill in the other values?
A use the recurrence
D(i,j) minD(i - 1, j) 1, D(i, j - 1) 1,
D(i - 1, j - 1) t (i,j)
where t (i,j) 1 if S1(i) ? S2(j), o/w t (i,j)
0.
We can first compute D(1,1) because we have
D(0,0), D(0,1), and D(1,0)
D(1,1) min 11, 11, 01 1
Then we have all the values needed to compute in
turn D(1,2), D(1,3),..,D(1,m)

20
Dynamic Programming
First compute D(1,1) because we have D(0,0),
D(0,1), and D(1,0) Then compute in turn D(1,2),
D(1,3),..,D(1,m)
21
Dynamic Programming
Fill in subsequent values, row by row, from left
to right.
22
Dynamic Programming
Alternatively, first compute D(1,1) from D(0,0),
D(0,1), and D(1,0) Then compute in turn D(2,1),
D(3,1),..,D(n,1)
23
Dynamic Programming
Fill in subsequent values, column by column, from
top to bottom.
24
Dynamic Programming

Filling each cell entails a constant number of
operations.
Cell (i,j) depends only on characters S1(i) and
S2(j) and cells (i - 1, j - 1), (i, j - 1), and
(i - 1, j).
There are O(nm) cells in the table
Consequently, we can compute the edit distance
D(n, m) in O(nm) time by computing the table in
O(nm).

25
Dynamic Programming

Having computed the table we know the value of
the optimal edit transcript.
Q How do we extract the optimal edit transcript
from the table?
A One way would be to establish pointers from
each cell, to predecessor cell(s) from which its
value was derived, i.e,
If D(i,j) D(i - 1, j) 1 add a pointer from
(i,j) to (i - 1, j)
If D(i,j) D(i, j - 1) 1 add a pointer from
(i,j) to (i, j - 1)
If D(i,j) D(i - 1, j - 1) t(i,j) add a
pointer from (i,j) to (i - 1, j - 1)

26
Dynamic Programming
27
Dynamic Programming

We can recover an optimal edit sequence simply by
following any path from (n,m) to (0,0)
The interpretation of the path links are
A horizontal link , (i,j) ?(i,j-1), corresponds
to an insertion of character S2(j) into S1.
A vertical link, (i,j) ?(i-1,j), corresponds to a
deletion of S1(i) from S1.
A diagonal link, (i,j) ?(i-1,j-1), corresponds to
a match S1(i) S2(j) and a substitution if
S1(i) ? S2(j)

28
Dynamic Programming
29
Dynamic Programming
An optimal edit path. What edit transcript
does this path correspond to?
S,S,S,M,D,M,M,I
30
Dynamic Programming
Another optimal edit path. What edit transcript
does this path correspond to?
I,S,M,D,M,D,M,M,I
31
Dynamic Programming
The third possible optimal edit path. What edit
transcript does this path correspond to?
S,I,M,D,M,D,M,M,I
32
Dynamic Programming

Alternatively we can interpret any path from
(n,m) to (0,0) as an alignment of S1 and S2.
The interpretation of the path links are
A horizontal link , (i,j) ?(i,j-1), corresponds
to an insertion of a space/dash into S1.
A vertical link, (i,j) ?(i-1,j), corresponds to
an insertion of a space/dash into S2.
A diagonal link, (i,j) ?(i-1,j-1), corresponds to
a match if S1(i) S2(j) or a mismatch if S1(i)
? S2(j)

33
Dynamic Programming
Possible optimal path. What alignment does
this optimal path correspond to?
w r i t - e r s v i n t n e r -
34
Dynamic Programming
A second possible optimal path. What alignment
does this optimal path correspond to?
w r i - t - e r s v - i n t n e r -
35
Dynamic Programming
A third possible optimal path. What alignment
does this optimal path correspond to?
w r i - t - e r s - v i n t n e r -
36
Summary

Any path from (n,m) to (0,0) corresponds to an
optimal edit sequence and an optimal alignment
We can recover all optimal edit sequences and
alignments simply by extracting all paths from
(n,m) to (0,0)
The correspondence between paths and edit
sequences is one-to-one.
The correspondence between paths and alignments
is one-to-one.

Write a Comment

User Comments (0)