CSE 5290: Algorithms for Bioinformatics Fall 2011 - PowerPoint PPT Presentation

Loading...

PPT – CSE 5290: Algorithms for Bioinformatics Fall 2011 PowerPoint presentation | free to download - id: 5d2edc-OTYzY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CSE 5290: Algorithms for Bioinformatics Fall 2011

Description:

Title: PowerPoint Presentation Author: Wim van Dam Last modified by: SD Created Date: 8/27/2001 7:35:01 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 86
Provided by: Wim89
Learn more at: http://www.eecs.yorku.ca
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSE 5290: Algorithms for Bioinformatics Fall 2011


1
CSE 5290 Algorithms for Bioinformatics Fall 2011
  • Suprakash Datta
  • datta_at_cse.yorku.ca
  • Office CSEB 3043
  • Phone 416-736-2100 ext 77875
  • Course page http//www.cse.yorku.ca/course/5290

2
Last time
  • Dynamic programming algorithms for Sequence
    alignment (global, local)
  • Next Divide and conquer algorithms
  • The following slides are based on slides by the
    authors of our text.

3
Divide and Conquer Algorithms
  • Steps
  • Divide problem into sub-problems
  • Conquer by solving sub-problems recursively. If
    the sub-problems are small enough, solve them in
    brute force fashion
  • Combine the solutions of sub-problems into a
    solution of the original problem (tricky part)

4
Examples of divide-and-conquer
  • Merge sort

5
Divide and Conquer Approach to LCS
  • Path(source, sink)
  • if(source sink are in consecutive columns)
  • output the longest path from source to sink
  • else
  • middle ? middle vertex between source sink
  • Path(source, middle)
  • Path(middle, sink)

The only problem left is how to find this middle
vertex!
6
Computing Alignment Path Requires Quadratic Memory
  • Alignment Path
  • Space complexity for computing alignment path for
    sequences of length n and m is O(nm)
  • We need to keep all backtracking references in
    memory to reconstruct the path (backtracking)

m
n
7
Computing Alignment Score with Linear Memory
  • Alignment Score
  • Space complexity of computing just the score
    itself is O(n)
  • We only need the previous column to calculate the
    current column, and we can then throw away that
    previous column once were done using it

2
n
n
8
Recall Computing LCS
Let vi prefix of v of length i v1
vi and wj prefix of w of length j w1 wj
The length of LCS(vi,wj) is computed by
9
Computing Alignment Score Recycling Columns
Only two columns of scores are saved at any given
time
memory for column 1 is used to calculate column 3
memory for column 2 is used to calculate column 4
10
Crossing the Middle Line
We want to calculate the longest path from (0,0)
to (n,m) that passes through (i,m/2) where i
ranges from 0 to n and represents the i-th
row Define length(i) as the
length of the longest path from (0,0) to (n,m)
that passes through vertex (i, m/2)
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
11
Crossing the Middle Line
m/2 m m/2 m m/2 m m/2 m
n
(i, m/2)
Prefix(i)
Suffix(i)
Define (mid,m/2) as the vertex where the longest
path crosses the middle column.
length(mid) optimal length max0?i ?n
length(i)
12
Computing Prefix(i)
  • prefix(i) is the length of the longest path from
    (0,0) to (i,m/2)
  • Compute prefix(i) by dynamic programming in the
    left half of the matrix

store prefix(i) column
0 m/2 m
13
Computing Suffix(i)
  • suffix(i) is the length of the longest path from
    (i,m/2) to (n,m)
  • suffix(i) is the length of the longest path from
    (n,m) to (i,m/2) with all edges reversed
  • Compute suffix(i) by dynamic programming in the
    right half of the reversed matrix

store suffix(i) column
0 m/2 m
14
Length(i) Prefix(i) Suffix(i)
  • Add prefix(i) and suffix(i) to compute length(i)
  • length(i)prefix(i) suffix(i)
  • You now have a middle vertex of the maximum path
    (i,m/2) as maximum of length(i)

0 i
middle point found
0 m/2 m
15
Finding the Middle Point
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m


16
Finding the Middle Point again
0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m 0 m/4 m/2 3m/4 m




17
And Again
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m








18
Time Area First Pass
  • On first pass, the algorithm covers the entire
    area

Area n?m
19
Time Area First Pass
  • On first pass, the algorithm covers the entire
    area

Area n?m
Computing prefix(i)
Computing suffix(i)
20
Time Area Second Pass
  • On second pass, the algorithm covers only 1/2 of
    the area

Area/2
21
Time Area Third Pass
  • On third pass, only 1/4th is covered.

Area/4
22
Geometric Reduction At Each Iteration
  • 1 ½ ¼ ... (½)k 2
  • Runtime O(Area) O(nm)

5th pass 1/16
3rd pass 1/4
first pass 1
4th pass 1/8
2nd pass 1/2
23
Is It Possible to Align Sequences in Subquadratic
Time?
  • Dynamic Programming takes O(n2) for global
    alignment
  • Can we do better?
  • Yes, use Four-Russians Speedup

24
Partitioning Sequences into Blocks
  • Partition the n x n grid into blocks of size t x
    t
  • We are comparing two sequences, each of size n,
    and each sequence is sectioned off into chunks,
    each of length t
  • Sequence u u1un becomes
  • u1ut ut1u2t un-t1un
  • and sequence v v1vn becomes
  • v1vt vt1v2t vn-t1vn

25
Partitioning Alignment Grid into Blocks
n/t
n
t
t
n/t
n
partition
26
Block Alignment
  • Block alignment of sequences u and v
  • An entire block in u is aligned with an entire
    block in v
  • An entire block is inserted
  • An entire block is deleted
  • Block path a path that traverses every t x t
    square through its corners

27
Block Alignment Examples
valid
invalid
28
Block Alignment Problem
  • Goal Find the longest block path through an edit
    graph
  • Input Two sequences, u and v partitioned into
    blocks of size t. This is equivalent to an n x n
    edit graph partitioned into t x t subgrids
  • Output The block alignment of u and v with the
    maximum score (longest block path through the
    edit graph

29
Constructing Alignments within Blocks
  • To solve compute alignment score ßi,j for each
    pair of blocks u(i-1)t1uit and
    v(j-1)t1vjt
  • How many blocks are there per sequence?
  • (n/t) blocks of size t
  • How many pairs of blocks for aligning the two
    sequences?
  • (n/t) x (n/t)
  • For each block pair, solve a mini-alignment
    problem of size t x t

30
Constructing Alignments within Blocks
n/t
Solve mini-alignmnent problems
Block pair represented by each small square
31
Block Alignment Dynamic Programming
  • Let si,j denote the optimal block alignment score
    between the first i blocks of u and first j
    blocks of v

?block is the penalty for inserting or deleting
an entire block ?i,j is score of pair of blocks
in row i and column j.
si-1,j - ?block si,j-1 - ?block si-1,j-1 - ?i,j
si,j max
32
Block Alignment Runtime
  • Indices i,j range from 0 to n/t
  • Running time of algorithm is
  • O( n/tn/t) O(n2/t2)
  • if we dont count the time to compute each
    ??i,j

33
Block Alignment Runtime (contd)
  • Computing all ??i,j requires solving (n/t)(n/t)
    mini block alignments, each of size (tt)
  • So computing ?all ?i,j takes time
  • O(n/tn/ttt) O(n2)
  • This is the same as dynamic programming
  • How do we speed this up?

34
Four Russians Technique
  • Let t log(n), where t is block size, n is
    sequence size.
  • Instead of having (n/t)(n/t) mini-alignments,
    construct 4t x 4t mini-alignments for all pairs
    of strings of t nucleotides (huge size), and put
    in a lookup table.
  • However, size of lookup table is not really that
    huge if t is small. Let t (log
    n)/4. Then 4t x 4t n

35
Look-up Table for Four Russians Technique
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
each sequence has t nucleotides
Lookup table Score
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA
size is only n, instead of (n/t)(n/t)
36
New Recurrence
  • The new lookup table Score is indexed by a pair
    of t-nucleotide strings, so

si-1,j - ?block si,j-1 - ?block si-1,j-1
Score(ith block of v, jth block of u)
si,j max
37
Four Russians Speedup Runtime
  • Since computing the lookup table Score of size n
    takes O(n) time, the running time is mainly
    limited by the (n/t)(n/t) accesses to the lookup
    table
  • Each access takes O(logn) time
  • Overall running time O( n2/t2logn )
  • Since t logn, substitute in
  • O( n2/logn2logn) gt O( n2/logn )

38
So Far
  • We can divide up the grid into blocks and run
    dynamic programming only on the corners of these
    blocks
  • In order to speed up the mini-alignment
    calculations to under n2, we create a lookup
    table of size n, which consists of all scores for
    all t-nucleotide pairs
  • Running time goes from quadratic, O(n2), to
    subquadratic O(n2/logn)

39
Four Russians Speedup for LCS
  • Unlike the block partitioned graph, the LCS path
    does not have to pass through the vertices of the
    blocks.

block alignment
longest common subsequence
40
Block Alignment vs. LCS
  • In block alignment, we only care about the
    corners of the blocks.
  • In LCS, we care about all points on the edges of
    the blocks, because those are points that the
    path can traverse.
  • Recall, each sequence is of length n, each block
    is of size t, so each sequence has (n/t) blocks.

41
Block Alignment vs. LCS Points Of Interest
block alignment has (n/t)(n/t) (n2/t2) points
of interest
LCS alignment has O(n2/t) points of interest
42
Traversing Blocks for LCS
  • Given alignment scores si, in the first row and
    scores s,j in the first column of a t x t mini
    square, compute alignment scores in the last row
    and column of the minisquare.
  • To compute the last row and the last column
    score, we use these 4 variables
  • alignment scores si, in the first row
  • alignment scores s,j in the first column
  • substring of sequence u in this block (4t
    possibilities)
  • substring of sequence v in this block (4t
    possibilities)

43
Traversing Blocks for LCS (contd)
  • If we used this to compute the grid, it would
    take quadratic, O(n2) time, but we want to do
    better.

we can calculate these scores
we know these scores
t x t block
44
Four Russians Speedup
  • Build a lookup table for all possible values of
    the four variables
  • all possible scores for the first row s,j
  • all possible scores for the first column s,j
  • substring of sequence u in this block (4t
    possibilities)
  • substring of sequence v in this block (4t
    possibilities)
  • For each quadruple we store the value of the
    score for the last row and last column.
  • Creates a huge table -- can eliminate alignments
    scores that dont make sense

45
Reducing Table Size
  • Alignment scores in LCS are monotonically
    increasing, and adjacent elements cant differ by
    more than 1
  • Example 0,1,2,2,3,4 is ok 0,1,2,4,5,8, is not
    because 2 and 4 differ by more than 1 (and so do
    5 and 8)
  • Therefore, we only need to store quadruples whose
    scores are monotonically increasing and differ by
    at most 1

46
Efficient Encoding of Alignment Scores
  • Instead of recording numbers that correspond to
    the index in the sequences u and v, we can use
    binary to encode the differences between the
    alignment scores

original encoding
0 1 2 2 3 4
1 1 0 0 1 1
binary encoding
47
Reducing Lookup Table Size
  • 2t possible scores (t size of blocks)
  • 4t possible strings
  • Lookup table size is (2t 2t)(4t 4t) 26t
  • Let t (logn)/4
  • Table size is 26((logn)/4) n(6/4) n(3/2)
  • Time O( n2/t2logn )
  • O( n2/logn2logn) gt O( n2/logn )

48
Summary
  • We take advantage of the fact that for each block
    of t log(n), we can pre-compute all possible
    scores and store them in a lookup table of size
    n(3/2)
  • We used the Four Russian speedup to go from a
    quadratic running time for LCS to subquadratic
    running time O(n2/log n)

49
Next
  • Graph algorithms
  • Some of the following slides are based on slides
    by the authors of our text.

50
DNA Sequencing
  • Shear DNA into millions of small fragments
  • Read 500 700 nucleotides at a time from the
    small fragments (Sanger method)

51
Fragment Assembly
  • Computational Challenge assemble individual
    short fragments (reads) into a single genomic
    sequence (superstring)
  • Until late 1990s the shotgun fragment assembly of
    human genome was viewed as intractable problem

52
Shortest Superstring Problem
  • Problem Given a set of strings, find a shortest
    string that contains all of them
  • Input Strings s1, s2,., sn
  • Output A string s that contains all strings
  • s1, s2,., sn as substrings, such that the
    length of s is minimized
  • Complexity NP complete
  • Note this formulation does not take into
    account sequencing errors

53
Shortest Superstring Problem Example
54
Reducing SSP to TSP
  • Define overlap ( si, sj ) as the length of the
    longest prefix of sj that matches a suffix of si.
  • aaaggcatcaaatctaaaggcatcaaa

  • aaaggcatcaaatctaaaggcatcaaa

What is overlap ( si, sj ) for these strings?
55
Reducing SSP to TSP
  • Define overlap ( si, sj ) as the length of the
    longest prefix of sj that matches a suffix of si.
  • aaaggcatcaaatctaaaggcatcaaa

  • aaaggcatcaaatctaaaggcatcaaa
  • aaaggcatcaaatctaaag
    gcatcaaa
  • overlap12

56
Reducing SSP to TSP
  • Define overlap ( si, sj ) as the length of the
    longest prefix of sj that matches a suffix of si.
  • aaaggcatcaaatctaaaggcatcaaa

  • aaaggcatcaaatctaaaggcatcaaa
  • aaaggcatcaaatctaaag
    gcatcaaa
  • Construct a graph with n vertices representing
    the n strings s1, s2,., sn.
  • Insert edges of length overlap ( si, sj ) between
    vertices si and sj.
  • Find the shortest path which visits every vertex
    exactly once. This is the Traveling Salesman
    Problem (TSP), which is also NP complete.

57
Reducing SSP to TSP (contd)
58
SSP to TSP An Example
  • S ATC, CCA, CAG, TCC, AGT
  • SSP
  • AGT
  • CCA
  • ATC
  • ATCCAGT
  • TCC
  • CAG

TSP
ATC
2
0
1
1
AGT
CCA
1
1
2
2
2
1
TCC
CAG
ATCCAGT
59
Sequencing by Hybridization (SBH) History
  • 1988 SBH suggested as an an alternative
    sequencing method. Nobody believed it will ever
    work
  • 1991 Light directed polymer synthesis developed
    by Steve Fodor and colleagues.
  • 1994 Affymetrix develops first 64-kb DNA
    microarray

First microarray prototype (1989)
First commercial DNA microarray prototype
w/16,000 features (1994)
500,000 features per chip (2002)
60
How SBH Works
  • Attach all possible DNA probes of length l to a
    flat surface, each probe at a distinct and known
    location. This set of probes is called the DNA
    array.
  • Apply a solution containing fluorescently labeled
    DNA fragment to the array.
  • The DNA fragment hybridizes with those probes
    that are complementary to substrings of length l
    of the fragment.

61
How SBH Works (contd)
  • Using a spectroscopic detector, determine which
    probes hybridize to the DNA fragment to obtain
    the lmer composition of the target DNA fragment.
  • Apply a combinatorial algorithm to reconstruct
    the sequence of the target DNA fragment from the
    l mer composition.

62
Hybridization on DNA Array
63
l-mer composition
  • Spectrum ( s, l ) - unordered multiset of all
    possible (n l 1) l-mers in a string s, length
    n
  • The order of individual elements in Spectrum
  • ( s, l ) does not matter
  • For s TATGGTGC all of the following are
    equivalent representations of Spectrum ( s, 3 )
  • TAT, ATG, TGG, GGT, GTG, TGC
  • ATG, GGT, GTG, TAT, TGC, TGG
  • TGG, TGC, TAT, GTG, GGT, ATG

We usually choose the lexicographically ordered
representation as the canonical one.
64
Different sequences same spectrum
  • Different sequences may have the same spectrum
  • Spectrum(GTATCT,2)
  • Spectrum(GTCTAT,2)
  • AT, CT, GT, TA, TC

65
The SBH Problem
  • Goal Reconstruct a string from its l-mer
    composition
  • Input A set S, representing all l-mers from an
    (unknown) string s
  • Output String s such that Spectrum ( s,l ) S

66
SBH Hamiltonian Path Approach
  • S ATG AGG TGC TCC GTC GGT GCA CAG

H
ATG
AGG
TGC
TCC
GTC
GCA
CAG
GGT
ATG
C
A
G
G
T
C
C
Path visited every VERTEX once
67
SBH Hamiltonian Path Approach
  • A more complicated graph
  • S ATG TGG TGC GTG GGC
    GCA GCG CGT

68
SBH Hamiltonian Path Approach
  • S ATG TGG TGC GTG GGC
    GCA GCG CGT
  • Path 1

ATGCGTGGCA
Path 2
ATGGCGTGCA
69
SBH Eulerian Path Approach
  • S ATG, TGC, GTG, GGC, GCA, GCG, CGT
  • Vertices correspond to ( l 1 ) mers
    AT, TG, GC, GG, GT, CA, CG
  • Edges correspond to l mers from S

70
SBH Eulerian Path Approach
  • S AT, TG, GC, GG, GT, CA, CG corresponds
    to two different paths

GT
CG
GT
CG
AT
TG
AT
GC
TG
GC
CA
CA
GG
GG
ATGGCGTGCA
ATGCGTGGCA
71
Euler Theorem
  • A graph is balanced if for every vertex the
    number of incoming edges equals to the number of
    outgoing edges
  • in(v)out(v)
  • Theorem A connected graph is Eulerian if and
    only if each of its vertices is balanced.

72
Euler Theorem Proof
  • Eulerian ? balanced
  • for every edge entering v (incoming edge)
    there exists an edge leaving v (outgoing edge).
    Therefore
  • in(v)out(v)
  • Balanced ? Eulerian
  • ???

73
Algorithm for Constructing an Eulerian Cycle
  1. Start with an arbitrary vertex v and form an
    arbitrary cycle with unused edges until a dead
    end is reached. Since the graph is Eulerian this
    dead end is necessarily the starting point, i.e.,
    vertex v.

74
Algorithm for Constructing an Eulerian Cycle
(contd)
  • b. If cycle from (a) above is not an Eulerian
    cycle, it must contain a vertex w, which has
    untraversed edges. Perform step (a) again, using
    vertex w as the starting point. Once again, we
    will end up in the starting vertex w.

75
Algorithm for Constructing an Eulerian Cycle
(contd)
  • c. Combine the cycles from (a) and (b) into a
    single cycle and iterate step (b).

76
Euler Theorem Extension
  • Theorem A connected graph has an Eulerian path
    if and only if it contains at most two
    semi-balanced vertices and all other vertices are
    balanced.

77
Some Difficulties with SBH
  • Fidelity of Hybridization difficult to detect
    differences between probes hybridized with
    perfect matches and 1 or 2 mismatches
  • Array Size Effect of low fidelity can be
    decreased with longer l-mers, but array size
    increases exponentially in l. Array size is
    limited with current technology.
  • Practicality SBH is still impractical. As DNA
    microarray technology improves, SBH may become
    practical in the future
  • Practicality again Although SBH is still
    impractical, it spearheaded expression analysis
    and SNP analysis techniques

78
Traditional DNA Sequencing
DNA
Shake
DNA fragments
Known location (restriction site)
Vector Circular genome (bacterium, plasmid)


79
Different Types of Vectors
VECTOR Size of insert (bp)
Plasmid 2,000 - 10,000
Cosmid 40,000
BAC (Bacterial Artificial Chromosome) 70,000 - 300,000
YAC (Yeast Artificial Chromosome) gt 300,000 Not used much recently
80
Shotgun Sequencing
genomic segment
cut many times at random (Shotgun)
Get one or two reads from each segment
500 bp
500 bp
81
Fragment Assembly
reads
Cover region with 7-fold redundancy
Overlap reads and extend to reconstruct the
original genomic region
82
Read Coverage
C
  • Length of genomic segment L
  • Number of reads n
    Coverage C n l / L
  • Length of each read l
  • How much coverage is enough?
  • Lander-Waterman model
  • Assuming uniform distribution of reads, C10
    results in 1 gapped region per 1,000,000
    nucleotides

83
Challenges in Fragment Assembly
  • Repeats A major problem for fragment assembly
  • gt 50 of human genome are repeats
  • - over 1 million Alu repeats (about 300 bp)
  • - about 200,000 LINE repeats (1000 bp and
    longer)

84
Triazzle A Fun Example
The puzzle looks simple BUT there are
repeats!!! The repeats make it very
difficult. Try it available at www.triazzle.com
85
Repeat Types
  • Low-Complexity DNA (e.g. ATATATATACATA)
  • Microsatellite repeats (a1ak)N where k 3-6
  • (e.g. CAGCAGTAGCAGCACCAG)
  • Transposons/retrotransposons
  • SINE Short Interspersed Nuclear Elements
  • (e.g., Alu 300 bp long, 106 copies)
  • LINE Long Interspersed Nuclear Elements
  • 500 - 5,000 bp long, 200,000 copies
  • LTR retroposons Long Terminal Repeats (700 bp)
    at each end
  • Gene Families genes duplicate then diverge
  • Segmental duplications very long, very similar
    copies
About PowerShow.com