Space Efficient Algorithms for Planted Motif Search - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Space Efficient Algorithms for Planted Motif Search

Description:

Jaime Davila, IWBRA 2006. 1. Space Efficient Algorithms for ... Jaime Davila, Sudha Balla, Sanguthevar Rajasekaran. CSE Department at University of Connecticut ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 23
Provided by: engrU
Category:

less

Transcript and Presenter's Notes

Title: Space Efficient Algorithms for Planted Motif Search


1
Space Efficient Algorithms for Planted Motif
Search
  • Jaime Davila, Sudha Balla, Sanguthevar
    Rajasekaran
  • CSE Department at University of Connecticut

2
Definition of (l,d) Motif Problem
  • Given sequences s1, s2 , sn of length m each.
  • Find a string x of size l an l-mer that
    appears as substring in all of them, with less
    than d mismatches in every occurrence. This is, x
    almost appears in si for i1,, n.

3
(l,d) Motif Problem, Example
  • s1 GGCATCCGATTATTGTAGTCTGG
  • s2 ATTTCTATGCTAAGCTTGCTCGA
  • s3 CAGGCTGTAAGTAGTTTGTTAGC
  • l5, d1

4
(l,d) Motif Problem, Solution
  • s1 GGCATCCGATTATTGTAGTCTGG
  • s2 ATTTCTATGCTAAGCTTGCTCGA
  • s3 CAGGCTGTAAGTAGTTTGTTAGC
  • x TTTTT is a (5,1) motif.

5
Motivation
  • Mining of Transcription Factor Binding Sites,
    which are small sequences of DNA that mark the
    beginning of coding regions in DNA. They might
    appear slightly modified in different sequences.

6
PMS Simple Planted Motif Search Raj et al 2005
  • l5 d1
  • s1 GGCATCCGATTATTGTAGTCTGG
  • s2 ATTTCTATGCTAAGCTTGCTCGA
  • s3 CAGGCTGTAAGTAGTTTGTTAGC

7
PMS Description
  • Build d-vicinities for each l-mer
  • s1 GGCATCCGATTATTGTAGTCTGG
  • s2 ATTTCTATGCTAAGCTTGCTCGA
  • s3 CAGGCTGTAAGTAGTTTGTTAGC

B(ATCCG,1)CTCCG,TTCCG, ,ATCCT
8
PMS Description
  • Let Li be the union of vicinities for each
    sequence. Sort them by using radix-sort
  • GGCATCCGATTATTGTAGTCTGG
  • ATTTCTATGCTAAGCTTGCTCGA
  • CAGGCTGTAAGTAGTTTGTTAGC

È
L1
È
L2
È
L3
9
PMS Description
  • 3) Mi is the intersection of Lk for k1, , i
  • GGCATCCGATTATTGTAGTCTGG
  • ATTTCTATGCTAAGCTTGCTCGA
  • CAGGCTGTAAGTAGTTTGTTAGC

È
L1
Ç
È
L2
Ç
È
L3

TTTTT, M3
10
PMS Drawbacks
  • As d increases the sizes of the Li increase
    considerably. For n20,m600, l15 and d4, the
    core memory requirement is over 1GB.

11
PMSi Key idea
  • L1 È B(x,d) (x is an l-mer in s1)
  • L2 È B(y,d) (y is an l-mer in s2)
  • M2 L1 Ç L2 (È B(x,d)) Ç (È B(y,d) )
  • È (B(x,d) Ç B(y,d) ) for all pairs (x, y )

12
PMSi Graphical Idea
  • Generate M2 by using De Morgan
  • GGCATCCGATTATTGTAGTCTGG
  • ATTTCTATGCTAAGCTTGCTCGA
  • CAGGCTGTAAGTAGTTTGTTAGC

È
Ç

13
PMSi Refinement
  • B(x,d) Ç B(y,d) Æ if dist(x,y) gt 2d
  • GGCATCCGATTATTGTAGTCTGG
  • ATTTCTATGCTAAGCTTGCTCGA
  • CAGGCTGTAAGTAGTTTGTTAGC

Ç

14
PMSi Intersections of vicinities
  • x is fixed l-mer in s1, y is any l-mer in s2.
  • È (B(x,d) Ç B(y,d) )
  • z Î B(x,d) y dist(z, y) d.
  • We cached the calculations of dist, to be
  • more efficient.

15
PMSi Drawbacks
  • We add more time depending on the number of
    l-mers whose distance is less than 2d from a
    given l-mer.
  • Depending on this number, we also have a
    bigger/lesser use of memory.

16
PMSP Key idea
  • We can iterate the basic principle of PMSi, i.e.
    x is fixed l-mer in s1, y is any l-mer in s2 , w
    any l-mer in s3
  • È (B(x,d) Ç B(y,d) Ç B(w,d))
  • z Î B(x,d)
  • y dist(z, y) d, w dist(z, w) d
    .

17
PMSP Graphical Idea
B(TTATT,1)ATATT,,TTTTT,.,TTATG
  • GGCATCCGATTATTGTAGTCTGG
  • ATTTCTATGCTAAGCTTGCTCGA
  • CAGGCTGTAAGTAGTTTGTTAGC
  • All vicinities considered are at distance less
    than 2d from l-mer in first sequence.

18
PMSP Observations
  • We reduce the memory usage drastically
  • We add more time depending on the number of
    l-mers whose distance is less than 2d from a
    given l-mer.

19
Experimental setting
  • n20, m600.
  • Every letter from every sequence is generated at
    random uniformly and independently.
  • A challenge instance is one where the expected
    number of (l,d) motifs is greater than 1, i.e.
    (11,3), (13,4), (15,5), (17,6)

20
Results (d3)
21
Results in Challenging Instances
22
Questions?
Write a Comment
User Comments (0)
About PowerShow.com