Title: An O(N2) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4,
1An O(N2) Algorithm for DiscoveringOptimal
Boolean Pattern PairsHideo Bannai, Heikki
Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta
Nakai and Satoru MiyanoIEEE/ACM TRANSACTIONS ON
COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1,
NO. 4, OCTOBER-DECEMBER 2004Presented
by,Sivaramakrishnan SubramanianGraduate
Student, CPSC, TAMU.siv_at_tamu.edu
2Motive
- Finding patterns conserved across a set of
biologically related sequences to extract meaning
is a common topic in Bioinformatics. - More than one sequence element can affect the
biological characteristics of the sequences. - Past work on finding composite patterns-
Structured Motifs, MITRA, Bioprospector
3Overview
- Given a set of sequences and numeric attribute
values for each sequence, the problem is to find
the optimal (w.r.t to a scoring function) pair of
patterns combined with any Boolean function. - Past work- finds combination of 2 patterns p and
q where (pq) occur in each string - this papers formulation allows all possible
combinations such as (pq)conditions like
presence of one element but absence of other
can be specified. - Thus this method can be used to find cooperative
as well as competing sequence elements. - O(N2) Algorithm and Implementation based on
suffix arrays (this is the homework!!!) are the
main contributions of this paper.
4Preliminaries
- Let ? be a finite alphabet e denote an empty
string. - Let ?(p,s) be a Boolean matching function? true
only if p is a substring of s. - Boolean pattern pair a triplet ltF,p,qgt where p
and q are patterns and F is a 2-ary Boolean
function. - Matching function value for a pattern pair
?(ltF,p,qgt,s) is defined as F(?(p,s),?(q,s)). - All possible F values are defined in the
following table.
5All Candidate Boolean Operations on ltF,p,qgt
6Preliminaries
- A pattern or a Boolean pattern pair ? matches a
string s if and only if ?(?,s) is true. Pattern e
matches any string. - For a given set of strings Ss1, . . ., sm let
M(?,S) denote the set of indices of strings in S
that ? matches, that is, M(?,S)i
?(?,si)true, and let its complement be denoted
as M(?,S)i?(?,si) false. - For each siS, we are given an associated numeric
attribute value ri. Let R(?,S) ?iM(?,S)ri
denote the sum of ri over all si that ? matches.
Let M(?) and R(?) be a shorthand notation for
M(?,S) and R(?,S), respectively. Note that
M(e)m R(e)?i1 to mri.
7Scoring Function
- Objective is to find a pattern that maximizes a
suitable scoring function score. - The paper concentrates on scoring functions whose
values for a pattern ? depend on values cumulated
over the strings in S that match ?. - Scoring function score takes parameters M(?)
and R(?). - Also assumed that the score value computation can
be done in constant time if the parameter values
are known. - Specific choice for the scoring function highly
depends on the particular application.
8Problem Definition
- Given a set Ss1, . . ., sm of strings, where
each string si is assigned a numeric attribute
value ri and a scoring function score
RxRgtR, find the Boolean pattern pair
?ltF,p,qgt p,q?,FF0,,F15 that
maximizes score(M(?),R(?)).
9Suffix tree GST
- Edges are labeled with substrings of s.
- For a node v, l(v) is the string obtained by
concatenating edge labels from root to v. - For each leaf node v, l(v) is a distinct suffix
of s for each suffix there exists a leaf v. - Each node has at least 2 children first
character of the labels on the edges to its
children are distinct. - GST Given a set Ss1, . . ., sm GST is a
suffix tree for the String s11. . .smm where
each i is a distinct character that does not
belong to ?. - All paths are ended at the first appearance of i
and each leaf is labeled with idi. - O(N) space and time.
10Suffix tree
S caggaggaccat. The paths of the suffix tree
from the root to the leaves (suffixes) are sorted
in lexicographic order from left to right, each
leaf corresponding to a position in the suffix
array. The integer in the suffix array represents
the position in the string from which the
corresponding suffix starts. Asij indicates
sjn is the ith suffix in the lexicographic
ordering The lcp array represents the length of
the longest path that consecutive suffixes in the
suffix array share.
11GST (Generalized Suffix Tree)
A Generalized Suffix Tree and its corresponding
suffix array for the strings facct, gctt, ctctg.
12A Naïve O(N3) Algorithm
- Let N ?i1 to mlength(si)
- O(N) candidates for a single pattern? patterns of
form l(v), where v is a node in the GST over the
set S. (Why???) - Hence O(N2) candidate pattern pairs
- For a given pair ltF,l(v1),l(v2)gt, the values
M(?) and R(?) can be computed in O(N) time by
any of the linear time string matching
algorithms. - Then scoring function value is calculated in
constant time given M(?) and R(?). - TimeO(N3). SpaceO(N) for Suffix tree.
13O(N2) Algorithm
- Two steps
- Find M(l(v)) and R(l(v)) for all nodes v of GST
in O(N) time and space - Solve optimal pair of substring patterns problem
in O(N2) time and O(N) space for any scoring
function score provided that it can be calculated
in constant time given its inputs.
14Algorithm- First step
- If R(l(v)) for all v can be found in O(N) time so
can be M(l(v). (when ri1 for all i,
R(l(v)M(l(V)) - Let LF(v) be the set of all leaf nodes in the
subtree rooted by node v. - Let ci(v) denote the number of leaves in LF(v)
that have the label idi. - Let sum of leaf attributes be ?LF(v)ri.
15Algorithm- First step
- ?LF(v)ri ?iM(l(v))(ci(v).ri)
- R(l(v)) ?iM(l(v))ri ?LF(v)ri -
?iM(l(v))((ci(v)-1).ri) (1) - Let correction factor be corr(l(v),S)?iM(l(v))((
ci(v)-1).ri) - In (1) ?LF(v)ri can be calculated for all v using
a linear time post-order traversal as ?LF(v)ri
?v(?LF(v)ri v is a child node of v).
16Algorithm- First step
- How to remove the redundancies (correcting
factors) in (1)? - Let I(idi) be the list of all leaves with the
label idi in the order they appear in the
post-order traversal of the tree. Constructing
the lists I can be done in linear time for all
labels idi. - The leaves in LF(v) with the label idi form a
continuous interval of length ci(v) in the list
I(idi). - If ci(v) gt 0, a length-ci(v) interval in I(idi)
contains (ci(v)-1) adjacent (overlapping) leaf
pairs. - If x,y LF(v), the node lca(x,y) belongs to the
subtree rooted by v. - For any si S, ?(l(v),si)true, that is, i
M(l(v)) if and only if there is a leaf x LF(v)
with the label idi.
17Algorithm- First step
- Initially correction value0 for all v.
- For each adjacent leaf pairs in I(idi) add ri to
the correction value of the node lca(x,y). - For each v, sum of correction values in the nodes
of the sub-tree rooted by v is (ci(v)-1).ri. - Repeat this for all lists I(idi)- the preceding
total sum becomes ?iM(l(v))((ci(v)-1).ri)
corr(l(v),S) - Perform a linear time bottom-up (post-order)
traversal to find R(l(v)).
18Algorithm- First step
Correction values at v1,v2,v3 set to r3,r2,r3
V3r3r2r3-r3 r2r3R(l(v3)) V2R(l(v3))r2-r2
r2r3R(l(v2)) V1r1R(l(v2))r3-r3
r1r2r3R(l(v1))
19Pseudo code for Step 1
20O(N2) Algorithm
- Two steps
- Find M(l(v)) and R(l(v)) for all nodes v of GST
in O(N) time and space - Solve optimal pair of substring patterns problem
in O(N2) time and O(N) space for any scoring
function score provided that it can be calculated
in constant time given its inputs.
21Algorithm- Second step
- O(N) choices for the first pattern?l(v1)
- For each l(v1)? use a modified version of the
previous algorithm for the O(N) choices for the
second pattern,l(v2) - given a fixed l(v1), we additionally label each
string siS and the corresponding leaves in the
GST with the Boolean value ?(l(v1),si)? O(N)
time. - Cumulate the sums and correction values
separately for true and false values of the
additional label.
22Algorithm- Second step
- ?iM(l(v2))(ri ?(l(v1),si) true)
?iM(l(v2))(ri ?(l(v1),si) true,
?(l(v2),si) true) R(ltF8,l(v1),l(v2)gt) - ?iM(l(v2))(ri ?(l(v1),si) false)
?iM(l(v2))(ri ?(l(v1),si) false,
?(l(v2),si) true) R(ltF2,l(v1),l(v2)gt) - ?iM(l(v2))(ri ?(l(v1),si) true)
?iM(l(v2))(ri ?(l(v1),si) true,
?(l(v2),si) false) R(ltF4,l(v1),l(v2)gt)
R(l(v1)) - R(ltF8,l(v1),l(v2)gt) - ?iM(l(v2))(ri ?(l(v1),si) false)
?iM(l(v2))(ri ?(l(v1),si) false,
?(l(v2),si) false) R(ltF1,l(v1),l(v2)gt)
R(e) R(l(v1) - R(ltF2,l(v1),l(v2)gt) - where R(e) R(l(v1) can be computed in linear
time.
23Algorithm- Second step
- All cumulative values of the form ?i(ri
?(l(v1),si) b1, ?(l(v2),si)b2) where
b1,b2true,false can be computed in linear
time. - Thus R(ltF,l(v1),l(v2)gt) and hence the score can
be computed in linear time for all pairs of the
form ltF,l(v1),l(v2)gt, given a fixed l(v1). - Thus O(N2) for all pattern pairs.
- Since the O(N) calculations for each l(v1) is
independent, the same GST can be reused. Hence
the space complexity is O(N).
24Algorithm- Second step
25The rest of the paper in a nutshell
- Extension for k-ary Boolean function.
- Implementation using suffix arrays.
- Computational experiments and results.
- Algorithm Variations? Multiple String Attributes,
Distance Restrictions.
26Homework
- Explain the implementation of the Optimal Boolean
Pattern Pair problem using suffix arrays in your
own words. Also explain why is it more efficient
than the suffix tree approach.
Email siv_at_tamu.edu
27