Title: CSE 5290: Algorithms for Bioinformatics Fall 2011
1CSE 5290 Algorithms for Bioinformatics Fall 2011
 Suprakash Datta
 datta_at_cse.yorku.ca
 Office CSEB 3043
 Phone 4167362100 ext 77875
 Course page http//www.cse.yorku.ca/course/5290
2Last time
 Dynamic programming algorithms for Sequence
alignment (global, local)  Next Divide and conquer algorithms
 The following slides are based on slides by the
authors of our text.
3Divide and Conquer Algorithms
 Steps
 Divide problem into subproblems
 Conquer by solving subproblems recursively. If
the subproblems are small enough, solve them in
brute force fashion  Combine the solutions of subproblems into a
solution of the original problem (tricky part)
4Examples of divideandconquer
5Divide and Conquer Approach to LCS
 Path(source, sink)
 if(source sink are in consecutive columns)
 output the longest path from source to sink
 else
 middle ? middle vertex between source sink
 Path(source, middle)
 Path(middle, sink)
The only problem left is how to find this middle
vertex!
6Computing Alignment Path Requires Quadratic Memory
 Alignment Path
 Space complexity for computing alignment path for
sequences of length n and m is O(nm)  We need to keep all backtracking references in
memory to reconstruct the path (backtracking)
m
n
7Computing Alignment Score with Linear Memory
 Alignment Score
 Space complexity of computing just the score
itself is O(n)  We only need the previous column to calculate the
current column, and we can then throw away that
previous column once were done using it
2
n
n
8Recall Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
9Computing Alignment Score Recycling Columns
Only two columns of scores are saved at any given
time
memory for column 1 is used to calculate column 3
memory for column 2 is used to calculate column 4
10Crossing the Middle Line
We want to calculate the longest path from (0,0)
to (n,m) that passes through (i,m/2) where i
ranges from 0 to n and represents the ith
row Define length(i) as the
length of the longest path from (0,0) to (n,m)
that passes through vertex (i, m/2)
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
11Crossing the Middle Line
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
Define (mid,m/2) as the vertex where the longest
path crosses the middle column.
length(mid) optimal length max0?i ?n
length(i)
12Computing Prefix(i)
 prefix(i) is the length of the longest path from
(0,0) to (i,m/2)  Compute prefix(i) by dynamic programming in the
left half of the matrix
store prefix(i) column
0 m/2 m
13Computing Suffix(i)
 suffix(i) is the length of the longest path from
(i,m/2) to (n,m)  suffix(i) is the length of the longest path from
(n,m) to (i,m/2) with all edges reversed  Compute suffix(i) by dynamic programming in the
right half of the reversed matrix
store suffix(i) column
0 m/2 m
14Length(i) Prefix(i) Suffix(i)
 Add prefix(i) and suffix(i) to compute length(i)
 length(i)prefix(i) suffix(i)
 You now have a middle vertex of the maximum path
(i,m/2) as maximum of length(i)
0 i
middle point found
0 m/2 m
15Finding the Middle Point
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m
16Finding the Middle Point again
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m
17And Again
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
18Time Area First Pass
 On first pass, the algorithm covers the entire
area
Area n?m
19Time Area First Pass
 On first pass, the algorithm covers the entire
area
Area n?m
Computing prefix(i)
Computing suffix(i)
20Time Area Second Pass
 On second pass, the algorithm covers only 1/2 of
the area
Area/2
21Time Area Third Pass
 On third pass, only 1/4th is covered.
Area/4
22Geometric Reduction At Each Iteration
 1 ½ ¼ ... (½)k 2
 Runtime O(Area) O(nm)
5th pass 1/16
3rd pass 1/4
first pass 1
4th pass 1/8
2nd pass 1/2
23Is It Possible to Align Sequences in Subquadratic
Time?
 Dynamic Programming takes O(n2) for global
alignment  Can we do better?
 Yes, use FourRussians Speedup
24Partitioning Sequences into Blocks
 Partition the n x n grid into blocks of size t x
t  We are comparing two sequences, each of size n,
and each sequence is sectioned off into chunks,
each of length t  Sequence u u1un becomes
 u1ut ut1u2t unt1un
 and sequence v v1vn becomes
 v1vt vt1v2t vnt1vn
25Partitioning Alignment Grid into Blocks
n/t
n
t
t
n/t
n
partition
26Block Alignment
 Block alignment of sequences u and v
 An entire block in u is aligned with an entire
block in v  An entire block is inserted
 An entire block is deleted
 Block path a path that traverses every t x t
square through its corners
27Block Alignment Examples
valid
invalid
28Block Alignment Problem
 Goal Find the longest block path through an edit
graph  Input Two sequences, u and v partitioned into
blocks of size t. This is equivalent to an n x n
edit graph partitioned into t x t subgrids  Output The block alignment of u and v with the
maximum score (longest block path through the
edit graph
29Constructing Alignments within Blocks
 To solve compute alignment score ßi,j for each
pair of blocks u(i1)t1uit and
v(j1)t1vjt  How many blocks are there per sequence?
 (n/t) blocks of size t
 How many pairs of blocks for aligning the two
sequences?  (n/t) x (n/t)
 For each block pair, solve a minialignment
problem of size t x t
30Constructing Alignments within Blocks
n/t
Solve minialignmnent problems
Block pair represented by each small square
31Block Alignment Dynamic Programming
 Let si,j denote the optimal block alignment score
between the first i blocks of u and first j
blocks of v
?block is the penalty for inserting or deleting
an entire block ?i,j is score of pair of blocks
in row i and column j.
si1,j  ?block si,j1  ?block si1,j1  ?i,j
si,j max
32Block Alignment Runtime
 Indices i,j range from 0 to n/t
 Running time of algorithm is
 O( n/tn/t) O(n2/t2)
 if we dont count the time to compute each
??i,j
33Block Alignment Runtime (contd)
 Computing all ??i,j requires solving (n/t)(n/t)
mini block alignments, each of size (tt)  So computing ?all ?i,j takes time
 O(n/tn/ttt) O(n2)
 This is the same as dynamic programming
 How do we speed this up?
34Four Russians Technique
 Let t log(n), where t is block size, n is
sequence size.  Instead of having (n/t)(n/t) minialignments,
construct 4t x 4t minialignments for all pairs
of strings of t nucleotides (huge size), and put
in a lookup table.  However, size of lookup table is not really that
huge if t is small. Let t (log
n)/4. Then 4t x 4t n
35Lookup Table for Four Russians Technique
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
each sequence has t nucleotides
Lookup table Score
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
size is only n, instead of (n/t)(n/t)
36New Recurrence
 The new lookup table Score is indexed by a pair
of tnucleotide strings, so
si1,j  ?block si,j1  ?block si1,j1
Score(ith block of v, jth block of u)
si,j max
37Four Russians Speedup Runtime
 Since computing the lookup table Score of size n
takes O(n) time, the running time is mainly
limited by the (n/t)(n/t) accesses to the lookup
table  Each access takes O(logn) time
 Overall running time O( n2/t2logn )
 Since t logn, substitute in
 O( n2/logn2logn) gt O( n2/logn )
38So Far
 We can divide up the grid into blocks and run
dynamic programming only on the corners of these
blocks  In order to speed up the minialignment
calculations to under n2, we create a lookup
table of size n, which consists of all scores for
all tnucleotide pairs  Running time goes from quadratic, O(n2), to
subquadratic O(n2/logn)
39Four Russians Speedup for LCS
 Unlike the block partitioned graph, the LCS path
does not have to pass through the vertices of the
blocks.
block alignment
longest common subsequence
40Block Alignment vs. LCS
 In block alignment, we only care about the
corners of the blocks.  In LCS, we care about all points on the edges of
the blocks, because those are points that the
path can traverse.  Recall, each sequence is of length n, each block
is of size t, so each sequence has (n/t) blocks.
41Block Alignment vs. LCS Points Of Interest
block alignment has (n/t)(n/t) (n2/t2) points
of interest
LCS alignment has O(n2/t) points of interest
42Traversing Blocks for LCS
 Given alignment scores si, in the first row and
scores s,j in the first column of a t x t mini
square, compute alignment scores in the last row
and column of the minisquare.  To compute the last row and the last column
score, we use these 4 variables  alignment scores si, in the first row
 alignment scores s,j in the first column
 substring of sequence u in this block (4t
possibilities)  substring of sequence v in this block (4t
possibilities)
43Traversing Blocks for LCS (contd)
 If we used this to compute the grid, it would
take quadratic, O(n2) time, but we want to do
better.
we can calculate these scores
we know these scores
t x t block
44Four Russians Speedup
 Build a lookup table for all possible values of
the four variables  all possible scores for the first row s,j
 all possible scores for the first column s,j
 substring of sequence u in this block (4t
possibilities)  substring of sequence v in this block (4t
possibilities)  For each quadruple we store the value of the
score for the last row and last column.  Creates a huge table  can eliminate alignments
scores that dont make sense
45Reducing Table Size
 Alignment scores in LCS are monotonically
increasing, and adjacent elements cant differ by
more than 1  Example 0,1,2,2,3,4 is ok 0,1,2,4,5,8, is not
because 2 and 4 differ by more than 1 (and so do
5 and 8)  Therefore, we only need to store quadruples whose
scores are monotonically increasing and differ by
at most 1
46Efficient Encoding of Alignment Scores
 Instead of recording numbers that correspond to
the index in the sequences u and v, we can use
binary to encode the differences between the
alignment scores
original encoding
0 1 2 2 3 4
1 1 0 0 1 1
binary encoding
47Reducing Lookup Table Size
 2t possible scores (t size of blocks)
 4t possible strings
 Lookup table size is (2t 2t)(4t 4t) 26t
 Let t (logn)/4
 Table size is 26((logn)/4) n(6/4) n(3/2)
 Time O( n2/t2logn )
 O( n2/logn2logn) gt O( n2/logn )
48Summary
 We take advantage of the fact that for each block
of t log(n), we can precompute all possible
scores and store them in a lookup table of size
n(3/2)  We used the Four Russian speedup to go from a
quadratic running time for LCS to subquadratic
running time O(n2/log n)
49Next
 Graph algorithms
 Some of the following slides are based on slides
by the authors of our text.
50DNA Sequencing
 Shear DNA into millions of small fragments
 Read 500 700 nucleotides at a time from the
small fragments (Sanger method)
51Fragment Assembly
 Computational Challenge assemble individual
short fragments (reads) into a single genomic
sequence (superstring)  Until late 1990s the shotgun fragment assembly of
human genome was viewed as intractable problem
52Shortest Superstring Problem
 Problem Given a set of strings, find a shortest
string that contains all of them  Input Strings s1, s2,., sn
 Output A string s that contains all strings
 s1, s2,., sn as substrings, such that the
length of s is minimized  Complexity NP complete
 Note this formulation does not take into
account sequencing errors
53Shortest Superstring Problem Example
54Reducing SSP to TSP
 Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.  aaaggcatcaaatctaaaggcatcaaa

aaaggcatcaaatctaaaggcatcaaa 
What is overlap ( si, sj ) for these strings?
55Reducing SSP to TSP
 Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.  aaaggcatcaaatctaaaggcatcaaa

aaaggcatcaaatctaaaggcatcaaa  aaaggcatcaaatctaaag
gcatcaaa  overlap12
56Reducing SSP to TSP
 Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.  aaaggcatcaaatctaaaggcatcaaa

aaaggcatcaaatctaaaggcatcaaa  aaaggcatcaaatctaaag
gcatcaaa  Construct a graph with n vertices representing
the n strings s1, s2,., sn.  Insert edges of length overlap ( si, sj ) between
vertices si and sj.  Find the shortest path which visits every vertex
exactly once. This is the Traveling Salesman
Problem (TSP), which is also NP complete.
57Reducing SSP to TSP (contd)
58SSP to TSP An Example
 S ATC, CCA, CAG, TCC, AGT
 SSP
 AGT
 CCA
 ATC
 ATCCAGT
 TCC
 CAG
TSP
ATC
2
0
1
1
AGT
CCA
1
1
2
2
2
1
TCC
CAG
ATCCAGT
59Sequencing by Hybridization (SBH) History
 1988 SBH suggested as an an alternative
sequencing method. Nobody believed it will ever
work  1991 Light directed polymer synthesis developed
by Steve Fodor and colleagues.  1994 Affymetrix develops first 64kb DNA
microarray
First microarray prototype (1989)
First commercial DNA microarray prototype
w/16,000 features (1994)
500,000 features per chip (2002)
60How SBH Works
 Attach all possible DNA probes of length l to a
flat surface, each probe at a distinct and known
location. This set of probes is called the DNA
array.  Apply a solution containing fluorescently labeled
DNA fragment to the array.  The DNA fragment hybridizes with those probes
that are complementary to substrings of length l
of the fragment.
61How SBH Works (contd)
 Using a spectroscopic detector, determine which
probes hybridize to the DNA fragment to obtain
the lmer composition of the target DNA fragment.  Apply a combinatorial algorithm to reconstruct
the sequence of the target DNA fragment from the
l mer composition.
62Hybridization on DNA Array
63lmer composition
 Spectrum ( s, l )  unordered multiset of all
possible (n l 1) lmers in a string s, length
n  The order of individual elements in Spectrum
 ( s, l ) does not matter
 For s TATGGTGC all of the following are
equivalent representations of Spectrum ( s, 3 )
 TAT, ATG, TGG, GGT, GTG, TGC
 ATG, GGT, GTG, TAT, TGC, TGG
 TGG, TGC, TAT, GTG, GGT, ATG
We usually choose the lexicographically ordered
representation as the canonical one.
64Different sequences same spectrum
 Different sequences may have the same spectrum
 Spectrum(GTATCT,2)
 Spectrum(GTCTAT,2)
 AT, CT, GT, TA, TC
65The SBH Problem
 Goal Reconstruct a string from its lmer
composition  Input A set S, representing all lmers from an
(unknown) string s  Output String s such that Spectrum ( s,l ) S
66SBH Hamiltonian Path Approach
 S ATG AGG TGC TCC GTC GGT GCA CAG
H
ATG
AGG
TGC
TCC
GTC
GCA
CAG
GGT
ATG
C
A
G
G
T
C
C
Path visited every VERTEX once
67SBH Hamiltonian Path Approach
 A more complicated graph

 S ATG TGG TGC GTG GGC
GCA GCG CGT
68SBH Hamiltonian Path Approach
 S ATG TGG TGC GTG GGC
GCA GCG CGT  Path 1
ATGCGTGGCA
Path 2
ATGGCGTGCA
69SBH Eulerian Path Approach
 S ATG, TGC, GTG, GGC, GCA, GCG, CGT
 Vertices correspond to ( l 1 ) mers
AT, TG, GC, GG, GT, CA, CG  Edges correspond to l mers from S
70SBH Eulerian Path Approach
 S AT, TG, GC, GG, GT, CA, CG corresponds
to two different paths
GT
CG
GT
CG
AT
TG
AT
GC
TG
GC
CA
CA
GG
GG
ATGGCGTGCA
ATGCGTGGCA
71Euler Theorem
 A graph is balanced if for every vertex the
number of incoming edges equals to the number of
outgoing edges  in(v)out(v)
 Theorem A connected graph is Eulerian if and
only if each of its vertices is balanced.
72Euler Theorem Proof
 Eulerian ? balanced
 for every edge entering v (incoming edge)
there exists an edge leaving v (outgoing edge).
Therefore  in(v)out(v)
 Balanced ? Eulerian
 ???
73Algorithm for Constructing an Eulerian Cycle
 Start with an arbitrary vertex v and form an
arbitrary cycle with unused edges until a dead
end is reached. Since the graph is Eulerian this
dead end is necessarily the starting point, i.e.,
vertex v.
74Algorithm for Constructing an Eulerian Cycle
(contd)
 b. If cycle from (a) above is not an Eulerian
cycle, it must contain a vertex w, which has
untraversed edges. Perform step (a) again, using
vertex w as the starting point. Once again, we
will end up in the starting vertex w.
75Algorithm for Constructing an Eulerian Cycle
(contd)
 c. Combine the cycles from (a) and (b) into a
single cycle and iterate step (b).
76Euler Theorem Extension
 Theorem A connected graph has an Eulerian path
if and only if it contains at most two
semibalanced vertices and all other vertices are
balanced.
77Some Difficulties with SBH
 Fidelity of Hybridization difficult to detect
differences between probes hybridized with
perfect matches and 1 or 2 mismatches  Array Size Effect of low fidelity can be
decreased with longer lmers, but array size
increases exponentially in l. Array size is
limited with current technology.  Practicality SBH is still impractical. As DNA
microarray technology improves, SBH may become
practical in the future  Practicality again Although SBH is still
impractical, it spearheaded expression analysis
and SNP analysis techniques
78Traditional DNA Sequencing
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)
79Different Types of Vectors
VECTOR Size of insert (bp)
Plasmid 2,000  10,000
Cosmid 40,000
BAC (Bacterial Artificial Chromosome) 70,000  300,000
YAC (Yeast Artificial Chromosome) gt 300,000 Not used much recently
80Shotgun Sequencing
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
81Fragment Assembly
reads
Cover region with 7fold redundancy
Overlap reads and extend to reconstruct the
original genomic region
82Read Coverage
C
 Length of genomic segment L
 Number of reads n
Coverage C n l / L  Length of each read l
 How much coverage is enough?
 LanderWaterman model
 Assuming uniform distribution of reads, C10
results in 1 gapped region per 1,000,000
nucleotides
83Challenges in Fragment Assembly
 Repeats A major problem for fragment assembly
 gt 50 of human genome are repeats
  over 1 million Alu repeats (about 300 bp)
  about 200,000 LINE repeats (1000 bp and
longer)
84Triazzle A Fun Example
The puzzle looks simple BUT there are
repeats!!! The repeats make it very
difficult. Try it available at www.triazzle.com
85Repeat Types
 LowComplexity DNA (e.g. ATATATATACATA)
 Microsatellite repeats (a1ak)N where k 36
 (e.g. CAGCAGTAGCAGCACCAG)
 Transposons/retrotransposons
 SINE Short Interspersed Nuclear Elements
 (e.g., Alu 300 bp long, 106 copies)
 LINE Long Interspersed Nuclear Elements
 500  5,000 bp long, 200,000 copies
 LTR retroposons Long Terminal Repeats (700 bp)
at each end  Gene Families genes duplicate then diverge
 Segmental duplications very long, very similar
copies