Title: Algorithms for Searching RNA Motifs in Genomic DNA Sequences
1Algorithms for Searching RNA Motifs in Genomic
DNA Sequences
- JCIS 2005 PRESENTATION
- Jingping Liu
- Bin Ma, Kaizhong Zhang
University of Western Ontario, Computer Science
Department July, 2005
2Outline
- Motivations
- RNA Structures
- Definitions Notations
- Tree Representation Algorithms
- Experimental Results
- Conclusions Further Work
3Motivations
- Bio-molecule structures are invaluable in
endeavors such as creating new drugs and
understanding genetic diseases. - Computational algorithms for genome analysis are
desirable. Fichants tRNAscan, Pavesis
EufindtRNA are popular tools - - Only for
computing tRNA! - Recently, Zhangs HomoStRscan for detecting RNA
has been introduced. Designing algorithms
functioning as a filter of Zhangs approach is
one of our primary motivations.
4RNA Structures
- RNA Primary structure is a sequence of bases
(i.e., ? A, C, G, U). - RNA Secondary structure is a set of base pairs
(i.e., A-U, C-G, G-U, and vice versa). Assume
(1) no base takes part in more than one base
pair (2) Base pairs never cross.
A
G
C
U
G
C
U
G
U
A
C
G
U
A
A
A
(a)
(b)
5Example of RNA Secondary Structure
Loop-Segment
5'
- The paired region is called stem
- The unpaired region is called loop-segment.
- The stem-loop consists of a stem surmounted by a
loop-segment (a.k.a. hairpin loop).
3'
Stem
Stem-Loop
6Problem Statement
RNA Structure
In a given genomic sequence, efficiently
determine candidate segments that can
potentially form RNA secondary structures
similar to a given RNA secondary structure.
Genomic Sequence
A
G
C
U
G
C
U
G
U
A
C
G
U
A
A
A
p
q
Candidate Segment
One sequence may form many different secondary
structures!
7Definitions Notations
- In a stem, the number of base pairs is called
stem size, denoted by SM. - In a loop-segment, the number of unpaired bases
is called loop-segment size, denoted by D. - Hairpin size the number of unpaired bases in a
hairpin loop, denoted by H.
U
G
U
C
Hairpin Loop
U
U
C
G
U
A
C
G
(H 4)
Loop-segment
A
U
(D 6)
U
A
Stem
G
C
(SM 4)
8Definitions Notations (contd)
- Similar Stem ?SM/2? SMx (SM ?SM/2?)
- Similar Loop-segment ?D/2? Dx (D ?D/2?)
(1) Similar Stems
C
G
C
G
U
A
C
G
C
G
A
U
A
U
G
C
A
U
A
U
U
A
U
A
U
A
U
A
2
3
G
C
G
C
G
C
G
C
A
U
SM4
5
U
A
6
(2) Similar Loop-segments
U
U
C
G
U
A
U
C
G
U
A
U
U
C
G
U
A
C
U
A
6
2
D4
5
3
9Tree Representation
- Leaf node -- represents stem-loop, and contains
necessary values. e.g. Hmin, Hmax of hairpin
loop. - Internal node (including root) -- represents
stem, and contains necessary values. e.g. SMmin,
SMmax of stem, and Dmin, Dmax of loop segments.
10Example of Tree Representation
3
A
C
C
5
A
(1) RNA
1
72
G
C
G
C
C
G
(2) Tree
U
A
G
C
n4
G
C
65
G
C
G
10
G
A
U
U
A
C
G
C
C
C
A
A
A
U
G
C
C
49
U
G
U
G
A
G
A
n3
n1
n2
U
28
44
A
A
C
G
G
C
A
U
A
C
C
26
G
A
U
U
U
U
C
G
U
A
C
G
A
U
C
C
U
U
A
G
A
11Bottom-Up Approach
- Leaf node (stem-loops) using SEARCH
- 2. Internal node (stem) with degree d
- Integrate step
- Extend step
- Repeat integrate and extend steps until root.
- Reduce the number of candidates by using
- additional biological features.
n4
RNA Tree
n3
n1
n2
12SEARCH Algorithm
- SEARCH Based on corresponding min and max values
of sizes, search for one potential stem-loop/stem
in the given genomic sequence S. - Given an index-pair (i, j), let i go left and j
go right to search each possible stem-loop/stem
in S by looking for consecutive base pairs
(i.e., A-U, C-G and G-U, and vice versa) until no
further such base pair.
SMmax
i
j
13Integrate and Extend Steps
- INTEGRATE (left-to-right) If loop-segment size
D of two located substructures satisfies Dmin ? D
? Dmax, then integrate them to form a more
complicated substructure (SP, EP). - EXTEND Use SEARCH algorithm, and start at (SP,
EP) to compute the extended stem, such that each
size satisfies corresponding min and max values.
1
(2) Tree
(1) RNA
D12
integrate steps step 1, and 2.
D41
SP
2
n4
EP
D23
4
D34
3
n3
n1
n2
14mSearch Experimental Results
tRNA
5S rRNA
Sequence
NCBI
mSearch
Sequence
NCBI
mSearch
Mycoplasma Genitalium ( 0.4M)
Staphylococcus MW2 (partial 0.4M)
992 ( 5s)
36
1986 ( 5s)
3
Helicobacter Pylori ( 1.7M)
Escherichia Coli K12 (partial 0.4M)
1349 ( 5s)
4
36
1640 ( 20s)
Experimental results indicate that our mSearch
tool can locate all true tRNAs and 5S rRNAs in
the experimented sequences - - zero false
negative !
15Conclusions Further Work
- Theoretical and Practical aspects
- Develop efficient pattern matching algorithms
- Search RNA motifs in genomic sequences.
- Our algorithms are suitable for searching any
type of RNA. - Currently, a web-based mSearch tool is under
construction. - Add biological constrains to further reduce the
number of candidates.
16References
- D. Gusfield. Algorithms on strings, trees, and
sequences. Computer Science and Computational
Biology. Cambridge University Press, 143-148
1997. - G. Mauri, and G. Pavesi. Pattern Discovery in RNA
Secondary Structures Using Affix Trees. CMP 278 -
294, 2003. - K. Zhang, S. Le, and J. Maizel. An Algorithm for
Detecting homologues of Known Structured RNAs in
Genomes. IEEE Computational Systems, CBS 2004. - K. Zhang, B. MA, and L.Wang. Computing
Similarity between RNA Structures. Theor. Comput.
Sci. 276(1-2) 111-132 2002. - M. Zuker. The Use of Dynamic Programming
Algorithms in RNA Secondary Structure Prediction,
in Mathematical Methods for DNA Sequences. CRC
Press, Inc.,Boca Raton, 1989, 159.
17QUESTION ?
THANK YOU