Chapter 4: Local motif finding and Global Multiple Alignment - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Chapter 4: Local motif finding and Global Multiple Alignment

Description:

... (motif finding) and Global Multiple Alignment ... multiple alignment probably implies ... Pearson showed alignment within a band gives very good results for ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 31
Provided by: min87
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4: Local motif finding and Global Multiple Alignment


1
Chapter 4 Local (motif finding) and Global
Multiple Alignment
2
  • When a problem is too hard, make it easy by
    finding the right problem.

3
1st Topic (Global) Multiple Alignment
  • To do phylogenetic analysis
  • Same protein from different species
  • Optimal multiple alignment probably implies
    history
  • Discover irregularities, such as Systic Fibrosis
    (CFTR) gene just one gap (for one codon
    deletion)
  • To find conserved regions
  • Local multiple alignment reveals conserved
    regions
  • Conserved regions usually are key functional
    regions
  • These regions are prime targets for drug
    developments

4
Definitions
  • Given sequences s1, sn, multiple alignment M
    puts them in n rows, one sequence per row, with
    spaces inserted to get supersequences S1, , Sn,
    SiL.
  • caaccca
  • ca cccc
  • ca cccg
  • ca ccct
  • SP-Alignment minimize sum of dH(Si,Sj) over all
    pairs i, j.
  • minM Si?j dH(Si,Sj)
  • Star-Alignment (also called Consensus Alignment)
    find center sequence S, minimize sum of dH(S,Si)
    over all i.
  • minS Si1,..,n dH(S,Si)
  • Three types of algorithms precise, heuristic,
    approximate
  • Approximation ratio for minimization problems r
    cappr / copt. If r1e for any e, then it is an
    approximation scheme. If this is in polynomial
    time for each fixed e, it is PTAS (poly. time
    appr. sch.) .

5
Dynamic Programming for Multiple Alignment
  • Given Aa1an, Bb1bn, Cc1cn
  • S(i,j,k)
  • max S(i-1,j-1,k-1)d(ai,bj,ck)
  • S(i-1,j-1,k) d(ai,bj,-)
  • S(i-1,j,k-1) d(ai,-,ck)
  • S(i,j-1,k-1) d(-,bj,ck)
  • S(i-1,j,k) d(ai,-,-)
  • S(i,j-1,k) d(-,bj,-)
  • S(i,j,k-1) d(-,-,ck)

6
CLUSTAL-W
  • Standard popular software
  • It does multiple alignment as follows
  • Align 2
  • Repeat keep on adding a new sequence to the
    alignment until no more, using tree-like
    heuristics.
  • Problem It is simply a heuristics.
  • Alternative dynamic programming nk for k
    sequences. This is too slow.
  • We need to understand the problem and solve it
    right.

7
Making the Problem Simpler!
  • Multiple alignment is hard
  • For k sequences, nk time, by dynamic programming
  • NP hard in general, not clear how to approximate
  • Popular practice -- alignment within a band the
    p-th letter in one sequence is not more than c
    places away from the p-th letter in another
    sequence in the final alignment the alignment
    is along a diagonal bandwidth 2c.
  • Used in final stage of FASTA program.

8
Literature
  • NP hardness under various models Wang-Jiang
    (JCB), Li-Ma-Wang (STOC99), W. Just
  • Approximation results Gusfield (2- 1/L), Bafna,
    Lawler, Pevzner (CPM94, 2-k/L), star alignment.
  • Sankoff, Kruskal discussed within a band
  • Pearson showed alignment within a band gives very
    good results for a lot protein superfamilies.
  • Altschul and Lipman, Chao-Pearson-Miller,
    Fickett, Ukkonen, Spouge (survey) all have
    studied alignment within a band.
  • Li, Ma, Wang (STOC, 2000), approximation scheme
    within a band (and show it is still NP-hard).

9
Alignment within a c-band
  • SP-Alignment
  • NP hard (even 0-gap)
  • PTAS
  • PTAS for constant number of insertion/deletion
    gaps per sequence on average (for coding regions,
    this assumption makes a lot of sense)
  • Star-Alignment
  • NP-hard (even 0-gap)
  • PTAS
  • PTAS for constant number of insertion/deletion
    gaps per sequence on average

10
Star Alignment, constant gaps
  • We first consider the simpler Consensus Pattern
    Problem Given a set S s1, s2, sn of
    sequences each of length m, and an integer L,
    find a median string s of length L and a
    substring ti (consensus patterns) of length L
    from each si, minimizing Si1..n dH(s, ti), where
    dH(x,y) is Hamming distance between x,y.
  • The use of Hamming distance is for convenience
    only
  • Using the consensusPattern algorithm, we will
    design PTAS for star alignment with constant
    number of gaps.
  • Then using the same idea we extend the alg. to SP
    alignment.

11
Idea Sampling r substrings
Consensus t1, tn
If we get r of those tis
j
j
We also expect this column has k percent as
If this column has k percent as
12
  • Chernoffs Bound Consider n independent random
    trails with success probability p. Let m be
    number of successes, then
  • (1) P(m-np en) ee2n/3p
  • Lemma. Let h(a) be the number of occurrences of
    letter a in a1, , an. Let a be a majority.
    Randomly pick r of them, let ar be the majority
    among these r letters. Then Eh(a)-h(ar)O(log
    r /vr)(n-h(a))
  • Proof. Consider picking a from a1 ... an in r
    independent trails with success probability
    pah(a)/n. Let ma be the number of successes of
    getting a. Define S to be the set of as whose
    frequencies are close to the majority a, i.e.
  • Sa rpa -rpa lt 2vr log r 1 .
  • (2) h(a)gth(a)- (2nlogr)/vr ,
    for all a in S.
  • Thus E(h(ar)) Sall aPr(ara)h(a)
  • Sa in S Pr(ara)h(a) Sa not
    in SPr(ara)h(a)
  • first term is large second
    term is tiny/unlikely
  • Sa in S Pr(ara)h(a) //
    deleted second term
  • Pr(marpa -vr log
    r)Pr(marpavr log r for all a not in S)
  • (h(a) (2n log r) / vr)
  • (1 e-O(log r))(1 e-O(log
    r))(h(a) 2nlog r / vr ) by Chernoff bound 1
  • (1 O(1/r))(h(a) (2nlog r)
    /vr)
  • Thus E(h(a) h(ar)) h(a) E(h(ar))
    O(log r / vr)(n h(a), assuming h(a)lt3n/4.

13
Algorithm consensusPattern
  • Algorithm consensusPattern
  • Input n sequences Ss1, ,sn, integers L and
    r.
  • Output the median string and consensus patterns
  • for every r length-L substrings u1,u2, ur
    from S do
  • (a) Set u to be the column-wise majority
    string of uis.
  • (b) for i1,2, , n do
  • Set ti to be the substring of length L
    from si that is
  • closest to u
  • (c) Let c(u)Si1..n dH(u,ti)
  • Output the u and the corresponding t1,t2, tn
    minimizing c(u)

14
Theorem. consensusPattern is achieves 1O(1/vr)
approximation ratio in nO(r) time.
  • Proof. Suppose s and t1, , tn are an optimal
    solution. Obviously s is the column-wise
    majority of t1 tn. Let coptSi1..ndH(s,ti).
    Let R contain any r indices in 1,,n and sR be
    the column-wise majority of tjs for jeR and cR
    be the corresponding distance. Let hj(a) be
    number of tis such that tija. Then we have
  • Si1..ndH(ti,s)Sj1..L(n-hj(sj))
    ()
  • Thus
  • cR copt Si1..ndH(ti,sR)
    Si1...ndH(ti,s) Sj1Lhj(sj) hj(sRj)
  • The average value of cR-copt over all choices R
    is
  • n-r Sall R cR copt n-r Sall R
    Sj1Lhj(sj)-hj(sRj)
  • Sj1Ln-r Sall
    Rhj(sj)-hj(sRj)
  • O(1/vr)
    Sj1L(n-hj(sj)) by Lemma
  • O(1/vr) copt
    by ()
  • Hence, there must be R of r indices such that
  • cR copt O(1/vr) copt


15
Star-alignment
  • Notation in an alignment, a block of inserted
    --- is called a gap. If a multiple alignment
    has c gaps on average for each sequence, we call
    it average c-gap alignment.
  • We first design a PTAS for the average c-gap SP
    alignment.
  • Then using the PTAS for the average c-gap SP
    alignment, we design a PTAS for SP-alignment
    within a band.

16
Star Alignment with c gaps
  • StarAlign
  • Input Ss1, ,sn, sim
  • Output a median string s.
  • for every set of r si in S do
  • for every alignment of these r sequences, with
    no more than cr gaps in each sequence, each gap
    size lt crm, do
  • (a) Find column-wise majority to form
    median s.
  • (b) Compute dSi1..ndE(s,si).
  • Output s which minimizes d.

17
Average c-gap SP Alignment
  • Key Idea choose r representative sequences, we
    find their correct alignment in the optimal
    alignment, by exhaustive search. Then we use this
    alignment as reference.
  • Then we align every other sequence against this
    alignment.
  • Then choose the best.
  • All we have to show is that there are r sequences
    whose letter frequencies in each column of their
    alignment approximates the complete alignment ---
    like the Star Alignment

18
Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
19
AverageSPAlign
  • Input Ss1,,sn, sim, integers c,l,r
  • Output a multiple alignment M
  • for Lm to nm
  • for any r sequences in S
  • for any possible alignment M (of r seq.)
    of length L
  • and with no more than cl gaps in each
    sequence
  • align all other sequences to M //one
    alignment
  • Output the best alignment.

20
SP Alignment within c-Band
  • Basic Idea
  • Dynamically cut seq-uences into segments
  • Each segment satisfies the average c-gap
    condition. Hence use previous algorithm
  • Then assemble the segments together
  • Divide-Conquer.

Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
21
The final algorithm diagonalAlign
  • while (not finished)
  • find a maximum prefix for each sequence (same
    length) such that AverageSPAlign returns low
    cost. Keep the multiple alignment for this
    segment
  • Concatenate the multiple alignments for all
    segments together to as final alignment.

22
2nd Topic Motif Finding (Local Multiple
Alignment)
  • Finding motifs/conserved regions in proteins is
    important in drug design and proteomics.
  • We have already solved one version consensus
    pattern problem.
  • We will present 3 different methods.
  • (1) Heuristics Gibbs Sampling method.
  • (2) Approximation algorithm
  • (3) LP relaxation.

23
Many versions
  • Input Ss1, ,sn
  • Consensus Pattern. Find s, substring ti of si,
    minimizing Si1..ndH(s,ti).
  • General Consensus Pattern in above, max
    Sj1..Lcj, where cjSj(a)logj(a), where j(a)
    is frequency of a in j-th column.
  • Closest string Find s, minimizing d s.t.
    dH(s,si)d, for all i.
  • Closest substring find s, ti of si, minimizing d
    s.t. dH(s,ti)d.
  • All have been solved with PTAS.

24
Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
25
1. Gibbs Sampling Algorithm
  • Many programs exist. Some use HMM. Most popular
    perhaps is the Gibbs sampling method by Lawrence
    et al
  • Input, S1, , Sm
  • Randomly choose a word wi from each Si
  • Randomly choose r
  • Create a frequency matrix (qij) from the m-1
    words not from Sr.
  • For each position in Sr, compute the probability
    a word fits the frequency matrix (qij), and use
    that word with the highest probability, repeat.
  • Works well, no guarantee.

26
2. PTAS
  • Input S1, , Sm, integer L.
  • Output t1, ,tm, tiL (motifs)
  • For every r length L substring, compute the
    consensus (under different metric) M of them.
  • In all strings, find substring of length L
    closest to M.
  • Choose the best.
  • Theorem This algorithm outputs a solution no
    worse than 11/sqrt(r) optimal. Time complexity
    is (nm)O(r).
  • Has guarantee, slow.

27
3. Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)
  • We now introduce a new technique.
  • For simplicity, lets consider a simplified
    problem. Given a set Ss1, sn sequences, each
    of length n, find string t that is far away from
    all si in S find t and largest d such that
    dH(t,si) gt d.
  • Formulating an LP problem. If binary strings xx1
    xn and sis1, sn, then dH(x,si)x1 x2
    xn, where xj xj if sj0, xj 1-xj if sj1,
    now we have an integer program
  • max d
  • Sj1..n aijxjgtCid for each si, aij0,1,-1
    Ci is from Sxi
  • xi 0, 1

28
LP Relaxation
  • Integer Program (as we have just described) is
    NP-hard. So it is hard to solve it directly.
  • However, Linear Program is polynomial time
    solvable. So we relax our requirement that
    solutions x1, xn have to be 0,1. Instead we may
    use 0?xi?1.
  • Then randomized rounding --- convert our fraction
    solution to integer solution by set xi1 with
    probability xi, xi0 with probability 1-xi.
  • What does this give us? Well, for each Sj1..n
    aijxj Ci, after randomized rounding, we lose
    about logn, if d is O(n), we are within d-logd
    with high probability. When d is small, we use
    other simpler techniques. Thus we have a very
    good approximation.
  • Similar idea, but much more involved, applies to
    the closest (sub)string problems.

29
Project Topics
  • Combining the PTAS approach with the background
    model studied by SinhaTompa
  • Improve running time for approximation
    algorithms. Implement a real system.
  • Simplify SP alignment proof using Chernoff bound
  • Compare with Pevzner-Sze Combinatorial
    Approaches to Finding Subtle Signals in DNA
    Sequences. ISMB 2000 269-278, and the projection
    approach of Buhler-Tompa.

http//www.cs.washington.edu/homes/tompa/papers/tf
bs.doc
http//www.cs.washington.edu/homes/tompa/papers/pr
oj.ps
30
Open Questions
  • The PTAS guarantee to converge, but slow. The
    heuristics are fast, but with no convergence
    guarantee.
  • Open Problem Design more practical efficient
    algorithms for these local/global alignment
    problems, possibly under more reasonable
    restrictions. It is important such a program
    works with huge input sizes.
Write a Comment
User Comments (0)
About PowerShow.com