DynamicProgramming Strategies for Analyzing Biomolecular Sequences - PowerPoint PPT Presentation

About This Presentation
Title:

DynamicProgramming Strategies for Analyzing Biomolecular Sequences

Description:

Each substructure is optimal. (Principle of optimality) 1. optimal substructures. 2. overlapping subproblems. Subproblems are dependent. ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 99
Provided by: VERI152
Category:

less

Transcript and Presenter's Notes

Title: DynamicProgramming Strategies for Analyzing Biomolecular Sequences


1
Dynamic-Programming Strategies for Analyzing
Biomolecular Sequences
  • Kun-Mao Chao (???)
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University, Taiwan
  • E-mail kmchao_at_csie.ntu.edu.tw
  • WWW http//www.csie.ntu.edu.tw/kmchao

2
Dynamic Programming
  • Dynamic programming is a class of solution
    methods for solving sequential decision problems
    with a compositional cost structure.
  • Richard Bellman was one of the principal founders
    of this approach.

3
Two key ingredients
  • Two key ingredients for an optimization problem
    to be suitable for a dynamic-programming solution

2. overlapping subproblems
1. optimal substructures
Subproblems are dependent. (otherwise, a
divide-and-conquer approach is the choice.)
Each substructure is optimal. (Principle of
optimality)
4
Three basic components
  • The development of a dynamic-programming
    algorithm has three basic components
  • The recurrence relation (for defining the value
    of an optimal solution)
  • The tabular computation (for computing the value
    of an optimal solution)
  • The traceback (for delivering an optimal
    solution).

5
Fibonacci numbers
The Fibonacci numbers are defined by the
following recurrence



6
How to compute F10?


7
Tabular computation
  • The tabular computation can avoid recompuation.

8
Maximum-sum interval
  • Given a sequence of real numbers a1a2an , find a
    consecutive subsequence with the maximum sum.

9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
For each position, we can compute the maximum-sum
interval starting at that position in O(n) time.
Therefore, a naive algorithm runs in O(n2) time.
9
O-notation an asymptotic upper bound
  • f(n) O(g(n)) iff there exist two positive
    constant c and n0 such that 0 f(n) cg(n)
    for all n n0

cg(n)
f(n)
n0
10
How functions grow?
function
n
(Assume one million operations per second.)
  • For large data sets, algorithms with a complexity
    greater than O(n log n) are often impractical!

11
Maximum-sum interval(The recurrence relation)
  • Define S(i) to be the maximum sum of the
    intervals ending at position i.

If S(i-1) lt 0, concatenating ai with its previous
interval gives less sum than ai itself.
12
Maximum-sum interval(Tabular computation)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum sum
13
Maximum-sum interval(Traceback)
9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7
The maximum-sum interval 6 -2 8 4
14
Two fundamental problems we recently solved (I)
(joint work with Lin and Jiang)
  • Given a sequence of real numbers of length n and
    an upper bound U, find a consecutive subsequence
    of length at most U with the maximum sum --- an
    O(n)-time algorithm.

U 3 9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9
15
Joint work with Huang, Jiang and Lin
  • Xiaoqiu Huang (???)Iowa State University, USA
  • Tao Jiang (??) University of California
    Riverside, USA
  • Yaw-Ling Lin (???)Providence University, Taiwan

16
Two fundamental problems we recently solved (II)
(joint work with Lin and Jiang)
  • Given a sequence of real numbers of length n and
    a lower bound L, find a consecutive subsequence
    of length at least L with the maximum average.
    --- an O(n log L)-time algorithm.

L 4 3 2 14 6 6 2 10 2 6 6 14 2 1
17
Another example
  • Given a sequence as follows 2, 6.6, 6.6, 3, 7,
    6, 7, 2
  • and L 2, the highest-average interval is the
    squared area, which has the average value 20/3.
  • 2, 6.6, 6.6, 3, 7, 6, 7, 2

18
CG rich regions
  • Our method can be used to locate a region of
    length at least L with the highest CG ratio in
    O(n log L) time.

ATGACTCGAGCTCGTCA 00101011011011010
Search for an interval of length at least L with
the highest average.
19
Length-unconstrained version
  • Maximum-average interval

3 2 14 6 6 2 10 2 6 6 14 2 1
The maximum element is the answer. It can be done
in O(n) time.
20
Q computing density in O(1) time?
  • prefix-sum(i) S1S2Si,
  • all n prefix sums are computable in O(n) time.
  • sum(i, j) prefix-sum(j) prefix-sum(i-1)
  • density(i, j) sum(i, j) / (j-i1)

j
i
prefix-sum(j)
prefix-sum(i-1)
21
A naive algorithm
  • A simple shift algorithm can compute the
    highest-average interval of a fixed length in
    O(n) time
  • Try L, L1, L2, ..., n. In total, O(n2).

22
An O(nL)-time algorithm (Huang, CABIOS 94)
  • Observing that there exists an optimal interval
    of length bounded by 2L, we immediately have an
    O(nL)-time algorithm.

We can bisect a region of length gt 2L into two
segments, where each of them is of length gt L.
23
Good partners
  • Finding good partner g(i) for each i1, 2, ,
    n-L1.

i L
g(i)
L
maximing avgi, g(i)
24
Right-Skew Decomposition
  • Partition S into substrings S1,S2,,Sk such that
  • each Si is a right-skew substring of S
  • the average of any prefix is always less than or
    equal to the average of the remaining suffix.
  • density(S1) gt density(S2) gt gt density(Sk)
  • Lin, Jiang, Chao
  • Unique
  • Computable in linear time.

25
An O(n log L)-time algorithm (Lin, Jiang, Chao,
JCSS02)
  • Decreasingly right-skew decomposition (O(n) time)

5
6
7.5
5
9
8
9
8
7
8
9 7 8 1 9 8
3 7 2 8
26
An O(n log L)-time algorithm
  • Jumping tables that allows binary search.
  • O(log L) time for each position to find its good
    partner, therefore the total running time is O(n
    log L).
  • We also implemented a program that linearly scans
    right-skew segments for the good partnership. Our
    empirical tests showed that it ran in linear time
    empirically.

27
Goldwasswer, Kao, and Lus recent progress
  • An O(n) time algorithm for the problem
  • Appeared in Proceedings of the Second Workshop on
    Algorithms in Bioinformatics (WABI), Rome, Itay,
    Sep. 17-21, 2002.

28
A new important observation
  • i lt j lt g(j) lt g(i) implies
  • density(i, g(i)) is no more than density(j, g(j))

29
k non-overlapping maximum-average segments (Lin,
Huang, Jiang, Chao, Bioinformatics)
  • Maintain all candidates in a priority queue.
  • A new k-best algorithm algorithms on trees (Lin
    and Chao (2003))

30
Longest increasing subsequence(LIS)
  • The longest increasing subsequence is to find a
    longest increasing subsequence of a given
    sequence of distinct integers a1a2an .
  • e.g. 9 2 5 3 7 11 8 10 13 6
  • 3 7
  • 7 10 13
  • 7 11
  • 3 5 11 13

are increasing subsequences.
We want to find a longest one.
are not increasing subsequences.
31
A naive approach for LIS
  • Let Li be the length of a longest increasing
    subsequence ending at position i.

Li 1 max j 0..i-1Lj aj lt ai(use a
dummy a0 minimum, and L00)
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 ?
32
A naive approach for LIS
  • Li 1 max j 0..i-1 Lj aj lt ai

9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 4 5
6 3
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest
increasing subsequence. This method runs in O(n2)
time.
33
Binary search
  • Given an ordered sequence x1x2 ... xn, where
    x1ltx2lt ... ltxn, and a number y, a binary search
    finds the largest xi such that xilt y in O(log n)
    time.

n/2
...
n/4
n
34
Binary search
  • How many steps would a binary search reduce the
    problem size to 1?n n/2 n/4 n/8 n/16
    ... 1

How many steps? O(log n) steps.
35
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
BestEnd3
11
8
8
8
BestEnd4
10
10
BestEnd5
13
BestEnd6
36
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
For each position, we perform a binary search to
update BestEnd. Therefore, the running time is
O(n log n).
10
10
10
BestEnd5
13
13
BestEnd6
37
Longest Common Subsequence (LCS)
  • A subsequence of a sequence S is obtained by
    deleting zero or more symbols from S. For
    example, the following are all subsequences of
    president pred, sdn, predent.
  • The longest common subsequence problem is to find
    a maximum-length common subsequence between two
    sequences.

38
LCS
  • For instance,
  • Sequence 1 president
  • Sequence 2 providence
  • Its LCS is priden.

president providence
39
LCS
  • Another example
  • Sequence 1 algorithm
  • Sequence 2 alignment
  • One of its LCS is algm.

a l g o r i t h m a l i g n m e n t
40
How to compute LCS?
  • Let Aa1a2am and Bb1b2bn .
  • len(i, j) the length of an LCS between
    a1a2ai and b1b2bj
  • With proper initializations, len(i, j)can be
    computed as follows.

41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Bioinformatics
46
Bioinformatics and Computational Biology-Related
Journals
  • Bioinformatics (previously called CABIOS)
  • Bulletin of Mathematical Biology
  • Computers and Biomedical Research
  • Genome Research
  • Genomics
  • Journal of Bioinformatics and Computational
    Biology
  • Journal of Computational Biology
  • Journal of Molecular Biology
  • Nature
  • Science

47
Bioinformatics and Computational Biology-Related
Conferences
  • the first IEEE Computer Society Bioinformatics
    Conference (CSB 2002, CA, USA)
  • Intelligent Systems for Molecular Biology (ISMB
    2003, Brisbane, Australia)
  • Pacific Symposium on Biocomputing(PSB 2003,
    Kauai, Hawaii, USA)
  • The Seventh Annual International Conference on
    Research in Computational Molecular
    Biology (RECOMB 2003, Berlin, Germany)

48
Bioinformatics and Computational Biology-Related
Books
  • Calculating the Secrets of Life Applications of
    the Mathematical Sciences in Molecular Biology,
    by Eric S. Lander and Michael S. Waterman (1995)
  • Introduction to Computational Biology Maps,
    Sequences, and Genomes, by Michael S. Waterman
    (1995)
  • Introduction to Computational Molecular Biology,
    by Joao Carlos Setubal and Joao Meidanis (1996)
  • Algorithms on Strings, Trees, and Sequences
    Computer Science and Computational Biology, by
    Dan Gusfield (1997)
  • Computational Molecular Biology An Algorithmic
    Approach, by Pavel Pevzner (2000)
  • Introduction to Bioinformatics, by Arthur M. Lesk
    (2002)

49
Useful Websites
  • MIT Biology Hypertextbook
  • http//www.mit.edu8001/afs/athena/course/other/es
    gbio/www/7001main.html
  • The International Society for Computational
    Biology
  • http//www.iscb.org/
  • National Center for Biotechnology Information
    (NCBI, NIH)
  • http//www.ncbi.nlm.nih.gov/
  • European Bioinformatics Institute (EBI)
  • http//www.ebi.ac.uk/
  • DNA Data Bank of Japan (DDBJ)
  • http//www.ddbj.nig.ac.jp/

50
Sequence Alignment
51
Dot Matrix
  • Sequence ACTTAACT
  • Sequence BCGGATCAT

C G G A T C A T
CTTAACT
52
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

C---TTAACTCGGATCA--T
Sequence A
Sequence B
53
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

Mismatch
Match
C---TTAACTCGGATCA--T
Deletion gap
Insertion gap
54
Alignment Graph
  • Sequence A CTTAACT
  • Sequence B CGGATCAT

C G G A T C A T
CTTAACT
C---TTAACTCGGATCA--T
55
A simple scoring scheme
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score
56
An optimal alignment-- the alignment of maximum
score
  • Let Aa1a2am and Bb1b2bn .
  • Si,j the score of an optimal alignment between
    a1a2ai and b1b2bj
  • With proper initializations, Si,j can be
    computedas follows.

57
Computing Si,j

j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
58
Initializations

C G G A T C A T
CTTAACT
59
S3,5 ?

C G G A T C A T
CTTAACT
60
S3,5 ?

C G G A T C A T
CTTAACT
optimal score
61
C T T A A C TC G G A T C A T
8 5 5 8 -5 8 -3 8 14

C G G A T C A T
CTTAACT
62
Global Alignment vs. Local Alignment
  • global alignment
  • local alignment

63
An optimal local alignment
  • Si,j the score of an optimal local alignment
    ending at ai and bj
  • With proper initializations, Si,j can be
    computedas follows.

64
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTAACT
65
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTAACT
The best score
66
A C - TA T C A T 8-38-38 18

C G G A T C A T
CTTAACT
The best score
67
Affine gap penalties
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)
  • Each gap is charged an extra gap-open penalty -4.

-4
-4
C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score 12 4 4 4
68
Affine gap panalties
  • A gap of length k is penalized x ky.

gap-open penalty
  • Three cases for alignment endings
  • ...x...x
  • ...x...-
  • ...-...x

gap-symbol penalty
an aligned pair
a deletion
an insertion
69
Affine gap penalties
  • Let D(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj ending with
    a deletion.
  • Let I(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj ending with
    an insertion.
  • Let S(i, j) denote the maximum score of any
    alignment between a1a2ai and b1b2bj.

70
Affine gap penalties
71
Affine gap penalties
-y
w(ai,bj)
-x-y
D
-x-y
S
I
-y
72
k best local alignments
  • Smith-Waterman(Smith and Waterman, 1981
    Waterman and Eggert, 1987)
  • FASTA(Wilbur and Lipman, 1983 Lipman and
    Pearson, 1985)
  • BLAST(Altschul et al., 1990 Altschul et al.,
    1997)

73
k best local alignments
  • Smith-Waterman(Smith and Waterman, 1981
    Waterman and Eggert, 1987)
  • linear-space version sim (Huang and Miller,
    1991)
  • linear-space variants sim2 (Chao et al., 1995)
    sim3 (Chao et al., 1997)
  • Chaining local alignments Gap3 (Huang and Chao,
    2003)
  • FASTA(Wilbur and Lipman, 1983 Lipman and
    Pearson, 1985)
  • linear-space band alignment (Chao et al., 1992)
  • BLAST(Altschul et al., 1990 Altschul et al.,
    1997)
  • Enhanced PatternHunter (Yang et al., coming soon)

74
FASTA
  • Find runs of identities, and identify regions
    with the highest density of identities.
  • Re-score using PAM matrix, and keep top scoring
    segments.
  • Eliminate segments that are unlikely to be part
    of the alignment.
  • Optimize the alignment in a band.

75
FASTA
Step 1 Find runes of identities, and identify
regions with the highest density of identities.
Sequence B
Sequence A
76
FASTA
Step 2 Re-score using PAM matrix, andkeep top
scoring segments.
77
FASTA
Step 3 Eliminate segments that are unlikely to
be part of the alignment.
78
FASTA
Step 4 Optimize the alignment in a band.
79
BLAST
  • Basic Local Alignment Search Tool(by Altschul,
    Gish, Miller, Myers and Lipman)
  • The central idea of the BLAST algorithm is that a
    statistically significant alignment is likely to
    contain a high-scoring pair of aligned words.

80
The maximal segment pair measure
  • A maximal segment pair (MSP) is defined to be the
    highest scoring pair of identical length segments
    chosen from 2 sequences.(for DNA Identities
    5 Mismatches -4)
  • The MSP score may be computed in time
    proportional to the product of their lengths.
    (How?) An exact procedure is too time consuming.
  • BLAST heuristically attempts to calculate the MSP
    score.

the highest scoring pair
81
BLAST
  • Build the hash table for Sequence A.
  • Scan Sequence B for hits.
  • Extend hits.

82
BLAST
Step 1 Build the hash table for Sequence A.
(3-tuple example)
For protein sequences Seq. A ELVISAdd xyz to
the hash table if Score(xyz, ELV) ? TAdd
xyz to the hash table if Score(xyz, LVI) ?
TAdd xyz to the hash table if Score(xyz,
VIS) ? T
For DNA sequences Seq. A AGATCGAT
12345678 AAAAAC..AGA 1..ATC 3..CGA
5..GAT 2 6..TCG 4..TTT

83
BLAST
Step2 Scan sequence B for hits.
84
BLAST
Step2 Scan sequence B for hits.
Step 3 Extend hits.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
hit
Terminate if the score of the sxtension fades
away. (That is, when we reach a segment pair
whose score falls a certain distance below the
best score found for shorter extensions.)
85
Remarks
  • Filtering is based on the observation that a good
    alignment usually includes short identical or
    very similar fragments.
  • The idea of filtration was used in both FASTA and
    BLAST.

86
Linear-space ideasHirschberg, 1975 Myers and
Miller, 1988
m/2
87
Two subproblems½ original problem size
m/4
m/2
3m/4
88
Four subproblems¼ original problem size
m/4
m/2
3m/4
89
Time and Space Complexity
  • Space O(MN)
  • TimeO(MN)(1 ½ ¼ ) O(MN)

2
90
Band Alignment(Joint work with W. Pearson and W.
Miller)
Sequence A

Sequence B
91
Band Alignment in Linear Space
The remaining subproblems are no longer only half
of the original problem. In worst case, this
could cause an additional log n factor in time.
92
Band Alignment in Linear Space
93
Multiple sequence alignment (MSA)
  • The multiple sequence alignment problem is to
    simultaneously align more than two sequences.

Seq1 GCTC Seq2 AC Seq3 GATC
GC-TC A---C G-ATC
94
How to score an MSA?
  • Sum-of-Pairs (SP-score)

GC-TC A---C
Score

GC-TC A---C G-ATC
GC-TC G-ATC
Score
Score


A---C G-ATC
Score
95
MSA for three sequences
  • an O(n3) algorithm

96
General MSA
  • For k sequences of length n O(nk)
  • NP-Complete (Wang and Jiang)
  • The exact multiple alignment algorithms for many
    sequences are not feasible.
  • Some approximation algorithms are given.(e.g.,
    2- l/k for any fixed l by Bafna et al.)

97
Progressive alignment
  • A heuristic approach proposed by Feng and
    Doolittle.
  • It iteratively merges the most similar pairs.
  • Once a gap, always a gap

The time for progressive alignment in most cases
is roughly the order of the time for computing
all pairwise alignment, i.e., O(k2n2) .
A B C D E
98
Concluding remarks
  • Three essential components of the
    dynamic-programming approach
  • the recurrence relation
  • the tabular computation
  • the traceback
  • The dynamic-programming approach has been used in
    a vast number of computational problems in
    bioinformatics.
Write a Comment
User Comments (0)
About PowerShow.com