Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming

Description:

conserved features that may reveal fingerprints of structure and function ... Note that there are two types of mismatches: Due to point mutations ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming


1
Introduction to Bioinformatics Lecture
IVSequence Similarity and Dynamic Programming
  • Jarek Meller
  • Division of Biomedical Informatics,
  • Childrens Hospital Research Foundation
  • Department of Biomedical Engineering, UC

2
Outline of the lecture
  • Wrapping up the previous lecture a quick look at
    the NCBI Map Viewer and suffix trees by way of an
    example
  • Inexact string matching from generalizations of
    suffix trees to dynamic programming
  • The dynamic programming algorithm for sequence
    alignment how it works
  • The dynamic programming algorithm for sequence
    alignment why it works
  • Limitations and faster heuristic approaches

3
Web watch NCBI Map Viewer
With the knowledge about STSs and physical
maps (hopefully) acquired last week we can have
another look at the NCBI Map Viewer
http//www.ncbi.nlm.nih.gov/mapview http//www.nc
bi.nlm.nih.gov/genome/guide/human/
4
Computationally efficient and elegant solutions
for the exact string matching problem
http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml Christian Charras and Thierry Lecroq
5
The idea of the suffix tree method
Phase 1 Preprocessing of the text A string
with m characters has m suffixes, which can be
represented as m leaves of a rooted directed
tree. Consider for example Tcabca
c
b

a
4
c
b
a
a
c
b

a

5

c
3
1
a

2
For simplicity one leaf, due to the terminal
character , is not included. Problem What is the
reason for adding the terminal character?
6
Suffix tree based matching why does it work?
Phase II Search A substring of a string is a
prefix of a suffix in that string. For example, a
substring Pab is a prefix of the suffix abca in
Tcabca. Thus, if P occurs in T there is a leaf
in the suffix tree that has a label starting with
P.
c
b

a
4
c
b
a
a
c
b

a

5

c
3
1
a

2
Problem Does the size of the alphabet matter (and
if so, how)? Hint how many edges may originate
in a node, given that label of each edge out of a
node has to start with a different character?
7
Generalized suffix tree for a set of strings and
the longest common substring problem
Consider for example two strings Tcabca and
Ubbcb.
U4
U3

c


b
b
b
U2
a

T4
c
a
b
a
c

b
b
a

T5

c
c
T3
T1
b
a


U1
T2
Remark By building the generalized suffix tree
for a set of k strings of the total length m one
can find the longest prefix-suffix match for all
pairs of strings in O(mk2) time (an additional
trick is required for that).
8
Assembling DNA from fragment and the
suffix-prefix matching problem
Hierarchical sequencing physical maps, clone
libraries and shotgun (see Chapter 2 in A Primer
on Genome Science by Gibson and
Muse) Definition The algorithmic problem of
shotgun sequence assembly is to deduce the
sequence of the DNA string from a set of
sequenced and partially overlapping short
substrings derived from that string. Analogy to
physical map assembly DNA sequence of a
substring may be viewed as a precise ordered
fingerprint (in analogy to STSs) and
the suffix-prefix match determines if two
substrings would be assembled together. In
general, the shortest superstring problem (find
the shortest string that contains each string
from a certain set of strings as its substring)
is NP-hard and heuristics are being developed to
address the problem.
9
Inexact or approximate string matching
  • Two major reasons for the importance of
    approximate matching in
  • computational molecular biology are
  • Measurement (e.g. sequencing) errors and fuzzy
    nature of underlying
  • molecular processes (e.g. hybridization may
    occur despite some
  • mismatches)
  • ii) Redundancy in biology with evolutionary
    processes resulting in closely
  • related, yet, different sequences that
    require approximate matching in
  • order to detect their relatedness and
    identify variable as well as
  • conserved features that may reveal
    fingerprints of structure and function
  • Either generalizations of exact string matching
    methods, such as suffix trees,
  • or dynamic programming (or their heuristic
    combinations) are being used to
  • solve this problem.

10
Redundancy in biological systems
An example two globin-like sequences
--MSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL
KTEAEMKASE M LSGEWQLVLVW KVEAD GHGQLIRLFK
HPETLEKFDFKHLKE EMKASE MGLSDGEWQLVLNVWGKVEADIPG
HGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASE
DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFI-
-AIIHVLHSRH DLKKHG TVLTALG ILKKKGHHEAE KP
AQSHATKHKIPKYLEFI I VL SH DLKKHGATVLTALGGILK
KKGHHEAE-KPLAQSHATKHKIPVKYLEFISEC-IQVLQSKH PGNFGA
DAQGAMNKALELFRKDIAAKYKELGYQG PGFGADAQGAMNKALELFRK
DA YKE PGDFGADAQGAMNKALELFRKDMASNYKE-----
  • Note that there are two types of mismatches
  • Due to point mutations
  • Due to insertions and deletions (gaps)

11
Gap penalties evolutionary and computational
considerations
  • Linear gap penalties
  • g(g) - g d
  • for a gap of length g and constant d
  • Affine gap penalties
  • g(g) - d (g -1) e
  • where d is opening gap penalty and e an
    extension gap penalty.

12
Dynamic programming algorithm for string alignment
  • Our goal is to find an optimal matching for two
    strings S1 a1a2an and S2 b1b2bm over a
    certain alphabet S, given a scoring matrix s(a,b)
    for each a and b in S and (for simplicity) a
    linear gap penalty
  • Relation to minimal edit distance (number of
    insertions, deletions and substitutions required
    to transform one string into the other) problem
  • The similarity measure (scoring matrix) should
    represent biological relatedness and separate
    true matches from random alignments (find more in
    Chapter 2 of Biological Sequence Analysis by
    Durbin et. al.)

13
How many alignments are there?
All the possible alignments (with gaps) may be
represented in the Form of a DP graph (DP
table). Consider an example with two strings of
length 2
\ ? a1b1a2b2 \
a1 a2
b1
b2
? b1a1b2a2 \_
0
1
1
\ \ _ _ ? a1b1b2a2
_ ? a1a2b1b2 \
1
3
5
_ _ \ ?
b1a1a2b2 \
1
5
13
_ _ _ _ _ _ _ _ _ _ _
? b1b2a1a2
_
14
Computing the number of alignments with gaps
Definition A string of length nm, obtained
by intercalating two strings S1 a1a2an and S2
b1b2bm , while preserving the order of the
symbols in S1 and S2, will be referred to as an
intercalated string and denoted by S1/2. Note
that S1 and S2 are subsequences of S1/2 but in
general they are not substrings of S1/2.
Definition Two alignments are called redundant
if their score is identical. The relationship of
having the same score may be used to define
equivalence classes of non-redundant alignments.
For example, the class a1b1b2a2
a1b1b2a2 ? a1-a2
a1a2- b1b2- b1-b2
15
Computing the number of alignments with gaps
Lemma There is one-to-one correspondence
(bijection) between the set of the non-redundant
gapped alignments of two strings S1 and S2 and
the set of the intercalated strings S1/2.
Corollary The number of non-redundant gapped
alignments of two strings, of length n and m,
respectively, is equal to (nm)!/m!n!.
Proof Since the order of each of the sequences
is preserved when intercalating them, we have in
fact nm positions to put m elements of the
second sequence (once this is done the position
of each of the elements of the first sequence is
fixed unambiguously). Hence, the total number of
intercalated sequences S1/2 is given by the
number of m-element combinations of nm elements
and the corollary is a simple consequence of the
one-to-one correspondence between alignments and
intercalated sequences stated in the lemma. QED
16
Computing the number of alignments with gaps
Problem Consider for simplicity two strings
of the same length and using the Stirling formula
(x! (2p)1/2 xx1/2 e-x ) show that
(nn)!/n!n! 22n /
(2pn)1/2 Note that for a very short by
biology standards sequence of length n50 one
needs to perform about 1030 basic operations for
an exhaustive search, making the naïve approach
infeasible.
Dynamic programming provides in polynomial time
an optimal solution for a class of optimization
problems with exponentially scaling search space,
including the approximate string matching.
Write a Comment
User Comments (0)
About PowerShow.com