# CSE 5290: Algorithms for Bioinformatics Fall 2011 - PowerPoint PPT Presentation

PPT – CSE 5290: Algorithms for Bioinformatics Fall 2011 PowerPoint presentation | free to download - id: 5d2edc-OTYzY The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CSE 5290: Algorithms for Bioinformatics Fall 2011

Description:

### Title: PowerPoint Presentation Author: Wim van Dam Last modified by: SD Created Date: 8/27/2001 7:35:01 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 86
Provided by: Wim89
Category:
Tags:
Transcript and Presenter's Notes

Title: CSE 5290: Algorithms for Bioinformatics Fall 2011

1
CSE 5290 Algorithms for Bioinformatics Fall 2011
• Suprakash Datta
• datta_at_cse.yorku.ca
• Office CSEB 3043
• Phone 416-736-2100 ext 77875
• Course page http//www.cse.yorku.ca/course/5290

2
Last time
• Dynamic programming algorithms for Sequence
alignment (global, local)
• Next Divide and conquer algorithms
• The following slides are based on slides by the
authors of our text.

3
Divide and Conquer Algorithms
• Steps
• Divide problem into sub-problems
• Conquer by solving sub-problems recursively. If
the sub-problems are small enough, solve them in
brute force fashion
• Combine the solutions of sub-problems into a
solution of the original problem (tricky part)

4
Examples of divide-and-conquer
• Merge sort

5
Divide and Conquer Approach to LCS
• Path(source, sink)
• if(source sink are in consecutive columns)
• output the longest path from source to sink
• else
• middle ? middle vertex between source sink
• Path(source, middle)
• Path(middle, sink)

The only problem left is how to find this middle
vertex!
6
Computing Alignment Path Requires Quadratic Memory
• Alignment Path
• Space complexity for computing alignment path for
sequences of length n and m is O(nm)
• We need to keep all backtracking references in
memory to reconstruct the path (backtracking)

m
n
7
Computing Alignment Score with Linear Memory
• Alignment Score
• Space complexity of computing just the score
itself is O(n)
• We only need the previous column to calculate the
current column, and we can then throw away that
previous column once were done using it

2
n
n
8
Recall Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
9
Computing Alignment Score Recycling Columns
Only two columns of scores are saved at any given
time
memory for column 1 is used to calculate column 3
memory for column 2 is used to calculate column 4
10
Crossing the Middle Line
We want to calculate the longest path from (0,0)
to (n,m) that passes through (i,m/2) where i
ranges from 0 to n and represents the i-th
row Define length(i) as the
length of the longest path from (0,0) to (n,m)
that passes through vertex (i, m/2)
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
11
Crossing the Middle Line
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
Define (mid,m/2) as the vertex where the longest
path crosses the middle column.
length(mid) optimal length max0?i ?n
length(i)
12
Computing Prefix(i)
• prefix(i) is the length of the longest path from
(0,0) to (i,m/2)
• Compute prefix(i) by dynamic programming in the
left half of the matrix

store prefix(i) column
0 m/2 m
13
Computing Suffix(i)
• suffix(i) is the length of the longest path from
(i,m/2) to (n,m)
• suffix(i) is the length of the longest path from
(n,m) to (i,m/2) with all edges reversed
• Compute suffix(i) by dynamic programming in the
right half of the reversed matrix

store suffix(i) column
0 m/2 m
14
Length(i) Prefix(i) Suffix(i)
• Add prefix(i) and suffix(i) to compute length(i)
• length(i)prefix(i) suffix(i)
• You now have a middle vertex of the maximum path
(i,m/2) as maximum of length(i)

0 i
middle point found
0 m/2 m
15
Finding the Middle Point
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m

16
Finding the Middle Point again
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m

17
And Again
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m

18
Time Area First Pass
• On first pass, the algorithm covers the entire
area

Area n?m
19
Time Area First Pass
• On first pass, the algorithm covers the entire
area

Area n?m
Computing prefix(i)
Computing suffix(i)
20
Time Area Second Pass
• On second pass, the algorithm covers only 1/2 of
the area

Area/2
21
Time Area Third Pass
• On third pass, only 1/4th is covered.

Area/4
22
Geometric Reduction At Each Iteration
• 1 ½ ¼ ... (½)k 2
• Runtime O(Area) O(nm)

5th pass 1/16
3rd pass 1/4
first pass 1
4th pass 1/8
2nd pass 1/2
23
Is It Possible to Align Sequences in Subquadratic
Time?
• Dynamic Programming takes O(n2) for global
alignment
• Can we do better?
• Yes, use Four-Russians Speedup

24
Partitioning Sequences into Blocks
• Partition the n x n grid into blocks of size t x
t
• We are comparing two sequences, each of size n,
and each sequence is sectioned off into chunks,
each of length t
• Sequence u u1un becomes
• u1ut ut1u2t un-t1un
• and sequence v v1vn becomes
• v1vt vt1v2t vn-t1vn

25
Partitioning Alignment Grid into Blocks
n/t
n
t
t
n/t
n
partition
26
Block Alignment
• Block alignment of sequences u and v
• An entire block in u is aligned with an entire
block in v
• An entire block is inserted
• An entire block is deleted
• Block path a path that traverses every t x t
square through its corners

27
Block Alignment Examples
valid
invalid
28
Block Alignment Problem
• Goal Find the longest block path through an edit
graph
• Input Two sequences, u and v partitioned into
blocks of size t. This is equivalent to an n x n
edit graph partitioned into t x t subgrids
• Output The block alignment of u and v with the
maximum score (longest block path through the
edit graph

29
Constructing Alignments within Blocks
• To solve compute alignment score ßi,j for each
pair of blocks u(i-1)t1uit and
v(j-1)t1vjt
• How many blocks are there per sequence?
• (n/t) blocks of size t
• How many pairs of blocks for aligning the two
sequences?
• (n/t) x (n/t)
• For each block pair, solve a mini-alignment
problem of size t x t

30
Constructing Alignments within Blocks
n/t
Solve mini-alignmnent problems
Block pair represented by each small square
31
Block Alignment Dynamic Programming
• Let si,j denote the optimal block alignment score
between the first i blocks of u and first j
blocks of v

?block is the penalty for inserting or deleting
an entire block ?i,j is score of pair of blocks
in row i and column j.
si-1,j - ?block si,j-1 - ?block si-1,j-1 - ?i,j
si,j max
32
Block Alignment Runtime
• Indices i,j range from 0 to n/t
• Running time of algorithm is
• O( n/tn/t) O(n2/t2)
• if we dont count the time to compute each
??i,j

33
Block Alignment Runtime (contd)
• Computing all ??i,j requires solving (n/t)(n/t)
mini block alignments, each of size (tt)
• So computing ?all ?i,j takes time
• O(n/tn/ttt) O(n2)
• This is the same as dynamic programming
• How do we speed this up?

34
Four Russians Technique
• Let t log(n), where t is block size, n is
sequence size.
• Instead of having (n/t)(n/t) mini-alignments,
construct 4t x 4t mini-alignments for all pairs
of strings of t nucleotides (huge size), and put
in a lookup table.
• However, size of lookup table is not really that
huge if t is small. Let t (log
n)/4. Then 4t x 4t n

35
Look-up Table for Four Russians Technique
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
each sequence has t nucleotides
Lookup table Score
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
size is only n, instead of (n/t)(n/t)
36
New Recurrence
• The new lookup table Score is indexed by a pair
of t-nucleotide strings, so

si-1,j - ?block si,j-1 - ?block si-1,j-1
Score(ith block of v, jth block of u)
si,j max
37
Four Russians Speedup Runtime
• Since computing the lookup table Score of size n
takes O(n) time, the running time is mainly
limited by the (n/t)(n/t) accesses to the lookup
table
• Each access takes O(logn) time
• Overall running time O( n2/t2logn )
• Since t logn, substitute in
• O( n2/logn2logn) gt O( n2/logn )

38
So Far
• We can divide up the grid into blocks and run
dynamic programming only on the corners of these
blocks
• In order to speed up the mini-alignment
calculations to under n2, we create a lookup
table of size n, which consists of all scores for
all t-nucleotide pairs
• Running time goes from quadratic, O(n2), to

39
Four Russians Speedup for LCS
• Unlike the block partitioned graph, the LCS path
does not have to pass through the vertices of the
blocks.

block alignment
longest common subsequence
40
Block Alignment vs. LCS
• In block alignment, we only care about the
corners of the blocks.
• In LCS, we care about all points on the edges of
the blocks, because those are points that the
path can traverse.
• Recall, each sequence is of length n, each block
is of size t, so each sequence has (n/t) blocks.

41
Block Alignment vs. LCS Points Of Interest
block alignment has (n/t)(n/t) (n2/t2) points
of interest
LCS alignment has O(n2/t) points of interest
42
Traversing Blocks for LCS
• Given alignment scores si, in the first row and
scores s,j in the first column of a t x t mini
square, compute alignment scores in the last row
and column of the minisquare.
• To compute the last row and the last column
score, we use these 4 variables
• alignment scores si, in the first row
• alignment scores s,j in the first column
• substring of sequence u in this block (4t
possibilities)
• substring of sequence v in this block (4t
possibilities)

43
Traversing Blocks for LCS (contd)
• If we used this to compute the grid, it would
take quadratic, O(n2) time, but we want to do
better.

we can calculate these scores
we know these scores
t x t block
44
Four Russians Speedup
• Build a lookup table for all possible values of
the four variables
• all possible scores for the first row s,j
• all possible scores for the first column s,j
• substring of sequence u in this block (4t
possibilities)
• substring of sequence v in this block (4t
possibilities)
• For each quadruple we store the value of the
score for the last row and last column.
• Creates a huge table -- can eliminate alignments
scores that dont make sense

45
Reducing Table Size
• Alignment scores in LCS are monotonically
increasing, and adjacent elements cant differ by
more than 1
• Example 0,1,2,2,3,4 is ok 0,1,2,4,5,8, is not
because 2 and 4 differ by more than 1 (and so do
5 and 8)
• Therefore, we only need to store quadruples whose
scores are monotonically increasing and differ by
at most 1

46
Efficient Encoding of Alignment Scores
• Instead of recording numbers that correspond to
the index in the sequences u and v, we can use
binary to encode the differences between the
alignment scores

original encoding
0 1 2 2 3 4
1 1 0 0 1 1
binary encoding
47
Reducing Lookup Table Size
• 2t possible scores (t size of blocks)
• 4t possible strings
• Lookup table size is (2t 2t)(4t 4t) 26t
• Let t (logn)/4
• Table size is 26((logn)/4) n(6/4) n(3/2)
• Time O( n2/t2logn )
• O( n2/logn2logn) gt O( n2/logn )

48
Summary
• We take advantage of the fact that for each block
of t log(n), we can pre-compute all possible
scores and store them in a lookup table of size
n(3/2)
• We used the Four Russian speedup to go from a
running time O(n2/log n)

49
Next
• Graph algorithms
• Some of the following slides are based on slides
by the authors of our text.

50
DNA Sequencing
• Shear DNA into millions of small fragments
• Read 500 700 nucleotides at a time from the
small fragments (Sanger method)

51
Fragment Assembly
• Computational Challenge assemble individual
short fragments (reads) into a single genomic
sequence (superstring)
• Until late 1990s the shotgun fragment assembly of
human genome was viewed as intractable problem

52
Shortest Superstring Problem
• Problem Given a set of strings, find a shortest
string that contains all of them
• Input Strings s1, s2,., sn
• Output A string s that contains all strings
• s1, s2,., sn as substrings, such that the
length of s is minimized
• Complexity NP complete
• Note this formulation does not take into
account sequencing errors

53
Shortest Superstring Problem Example
54
Reducing SSP to TSP
• Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.
• aaaggcatcaaatctaaaggcatcaaa

• aaaggcatcaaatctaaaggcatcaaa

What is overlap ( si, sj ) for these strings?
55
Reducing SSP to TSP
• Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.
• aaaggcatcaaatctaaaggcatcaaa

• aaaggcatcaaatctaaaggcatcaaa
• aaaggcatcaaatctaaag
gcatcaaa
• overlap12

56
Reducing SSP to TSP
• Define overlap ( si, sj ) as the length of the
longest prefix of sj that matches a suffix of si.
• aaaggcatcaaatctaaaggcatcaaa

• aaaggcatcaaatctaaaggcatcaaa
• aaaggcatcaaatctaaag
gcatcaaa
• Construct a graph with n vertices representing
the n strings s1, s2,., sn.
• Insert edges of length overlap ( si, sj ) between
vertices si and sj.
• Find the shortest path which visits every vertex
exactly once. This is the Traveling Salesman
Problem (TSP), which is also NP complete.

57
Reducing SSP to TSP (contd)
58
SSP to TSP An Example
• S ATC, CCA, CAG, TCC, AGT
• SSP
• AGT
• CCA
• ATC
• ATCCAGT
• TCC
• CAG

TSP
ATC
2
0
1
1
AGT
CCA
1
1
2
2
2
1
TCC
CAG
ATCCAGT
59
Sequencing by Hybridization (SBH) History
• 1988 SBH suggested as an an alternative
sequencing method. Nobody believed it will ever
work
• 1991 Light directed polymer synthesis developed
by Steve Fodor and colleagues.
• 1994 Affymetrix develops first 64-kb DNA
microarray

First microarray prototype (1989)
First commercial DNA microarray prototype
w/16,000 features (1994)
500,000 features per chip (2002)
60
How SBH Works
• Attach all possible DNA probes of length l to a
flat surface, each probe at a distinct and known
location. This set of probes is called the DNA
array.
• Apply a solution containing fluorescently labeled
DNA fragment to the array.
• The DNA fragment hybridizes with those probes
that are complementary to substrings of length l
of the fragment.

61
How SBH Works (contd)
• Using a spectroscopic detector, determine which
probes hybridize to the DNA fragment to obtain
the lmer composition of the target DNA fragment.
• Apply a combinatorial algorithm to reconstruct
the sequence of the target DNA fragment from the
l mer composition.

62
Hybridization on DNA Array
63
l-mer composition
• Spectrum ( s, l ) - unordered multiset of all
possible (n l 1) l-mers in a string s, length
n
• The order of individual elements in Spectrum
• ( s, l ) does not matter
• For s TATGGTGC all of the following are
equivalent representations of Spectrum ( s, 3 )
• TAT, ATG, TGG, GGT, GTG, TGC
• ATG, GGT, GTG, TAT, TGC, TGG
• TGG, TGC, TAT, GTG, GGT, ATG

We usually choose the lexicographically ordered
representation as the canonical one.
64
Different sequences same spectrum
• Different sequences may have the same spectrum
• Spectrum(GTATCT,2)
• Spectrum(GTCTAT,2)
• AT, CT, GT, TA, TC

65
The SBH Problem
• Goal Reconstruct a string from its l-mer
composition
• Input A set S, representing all l-mers from an
(unknown) string s
• Output String s such that Spectrum ( s,l ) S

66
SBH Hamiltonian Path Approach
• S ATG AGG TGC TCC GTC GGT GCA CAG

H
ATG
AGG
TGC
TCC
GTC
GCA
CAG
GGT
ATG
C
A
G
G
T
C
C
Path visited every VERTEX once
67
SBH Hamiltonian Path Approach
• A more complicated graph
• S ATG TGG TGC GTG GGC
GCA GCG CGT

68
SBH Hamiltonian Path Approach
• S ATG TGG TGC GTG GGC
GCA GCG CGT
• Path 1

ATGCGTGGCA
Path 2
ATGGCGTGCA
69
SBH Eulerian Path Approach
• S ATG, TGC, GTG, GGC, GCA, GCG, CGT
• Vertices correspond to ( l 1 ) mers
AT, TG, GC, GG, GT, CA, CG
• Edges correspond to l mers from S

70
SBH Eulerian Path Approach
• S AT, TG, GC, GG, GT, CA, CG corresponds
to two different paths

GT
CG
GT
CG
AT
TG
AT
GC
TG
GC
CA
CA
GG
GG
ATGGCGTGCA
ATGCGTGGCA
71
Euler Theorem
• A graph is balanced if for every vertex the
number of incoming edges equals to the number of
outgoing edges
• in(v)out(v)
• Theorem A connected graph is Eulerian if and
only if each of its vertices is balanced.

72
Euler Theorem Proof
• Eulerian ? balanced
• for every edge entering v (incoming edge)
there exists an edge leaving v (outgoing edge).
Therefore
• in(v)out(v)
• Balanced ? Eulerian
• ???

73
Algorithm for Constructing an Eulerian Cycle
arbitrary cycle with unused edges until a dead
end is reached. Since the graph is Eulerian this
dead end is necessarily the starting point, i.e.,
vertex v.

74
Algorithm for Constructing an Eulerian Cycle
(contd)
• b. If cycle from (a) above is not an Eulerian
cycle, it must contain a vertex w, which has
untraversed edges. Perform step (a) again, using
vertex w as the starting point. Once again, we
will end up in the starting vertex w.

75
Algorithm for Constructing an Eulerian Cycle
(contd)
• c. Combine the cycles from (a) and (b) into a
single cycle and iterate step (b).

76
Euler Theorem Extension
• Theorem A connected graph has an Eulerian path
if and only if it contains at most two
semi-balanced vertices and all other vertices are
balanced.

77
Some Difficulties with SBH
• Fidelity of Hybridization difficult to detect
differences between probes hybridized with
perfect matches and 1 or 2 mismatches
• Array Size Effect of low fidelity can be
decreased with longer l-mers, but array size
increases exponentially in l. Array size is
limited with current technology.
• Practicality SBH is still impractical. As DNA
microarray technology improves, SBH may become
practical in the future
• Practicality again Although SBH is still
and SNP analysis techniques

78
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)

79
Different Types of Vectors
VECTOR Size of insert (bp)
Plasmid 2,000 - 10,000
Cosmid 40,000
BAC (Bacterial Artificial Chromosome) 70,000 - 300,000
YAC (Yeast Artificial Chromosome) gt 300,000 Not used much recently
80
Shotgun Sequencing
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
81
Fragment Assembly
Cover region with 7-fold redundancy
Overlap reads and extend to reconstruct the
original genomic region
82
C
• Length of genomic segment L
Coverage C n l / L
• Length of each read l
• How much coverage is enough?
• Lander-Waterman model
• Assuming uniform distribution of reads, C10
results in 1 gapped region per 1,000,000
nucleotides

83
Challenges in Fragment Assembly
• Repeats A major problem for fragment assembly
• gt 50 of human genome are repeats
• - over 1 million Alu repeats (about 300 bp)
• - about 200,000 LINE repeats (1000 bp and
longer)

84
Triazzle A Fun Example
The puzzle looks simple BUT there are
repeats!!! The repeats make it very
difficult. Try it available at www.triazzle.com
85
Repeat Types
• Low-Complexity DNA (e.g. ATATATATACATA)
• Microsatellite repeats (a1ak)N where k 3-6
• (e.g. CAGCAGTAGCAGCACCAG)
• Transposons/retrotransposons
• SINE Short Interspersed Nuclear Elements
• (e.g., Alu 300 bp long, 106 copies)
• LINE Long Interspersed Nuclear Elements
• 500 - 5,000 bp long, 200,000 copies
• LTR retroposons Long Terminal Repeats (700 bp)
at each end
• Gene Families genes duplicate then diverge
• Segmental duplications very long, very similar
copies