An O(N2) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, - PowerPoint PPT Presentation

About This Presentation
Title:

An O(N2) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4,

Description:

... alphabet & e denote an empty string. ... Pattern e matches any string. ... For a node v, l(v) is the string obtained by concatenating edge labels from root to v. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An O(N2) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru Miyano IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4,


1
An O(N2) Algorithm for DiscoveringOptimal
Boolean Pattern PairsHideo Bannai, Heikki
Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta
Nakai and Satoru MiyanoIEEE/ACM TRANSACTIONS ON
COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1,
NO. 4, OCTOBER-DECEMBER 2004Presented
by,Sivaramakrishnan SubramanianGraduate
Student, CPSC, TAMU.siv_at_tamu.edu
2
Motive
  • Finding patterns conserved across a set of
    biologically related sequences to extract meaning
    is a common topic in Bioinformatics.
  • More than one sequence element can affect the
    biological characteristics of the sequences.
  • Past work on finding composite patterns-
    Structured Motifs, MITRA, Bioprospector

3
Overview
  • Given a set of sequences and numeric attribute
    values for each sequence, the problem is to find
    the optimal (w.r.t to a scoring function) pair of
    patterns combined with any Boolean function.
  • Past work- finds combination of 2 patterns p and
    q where (pq) occur in each string
  • this papers formulation allows all possible
    combinations such as (pq)conditions like
    presence of one element but absence of other
    can be specified.
  • Thus this method can be used to find cooperative
    as well as competing sequence elements.
  • O(N2) Algorithm and Implementation based on
    suffix arrays (this is the homework!!!) are the
    main contributions of this paper.

4
Preliminaries
  • Let ? be a finite alphabet e denote an empty
    string.
  • Let ?(p,s) be a Boolean matching function? true
    only if p is a substring of s.
  • Boolean pattern pair a triplet ltF,p,qgt where p
    and q are patterns and F is a 2-ary Boolean
    function.
  • Matching function value for a pattern pair
    ?(ltF,p,qgt,s) is defined as F(?(p,s),?(q,s)).
  • All possible F values are defined in the
    following table.

5
All Candidate Boolean Operations on ltF,p,qgt
6
Preliminaries
  • A pattern or a Boolean pattern pair ? matches a
    string s if and only if ?(?,s) is true. Pattern e
    matches any string.
  • For a given set of strings Ss1, . . ., sm let
    M(?,S) denote the set of indices of strings in S
    that ? matches, that is, M(?,S)i
    ?(?,si)true, and let its complement be denoted
    as M(?,S)i?(?,si) false.
  • For each siS, we are given an associated numeric
    attribute value ri. Let R(?,S) ?iM(?,S)ri
    denote the sum of ri over all si that ? matches.
    Let M(?) and R(?) be a shorthand notation for
    M(?,S) and R(?,S), respectively. Note that
    M(e)m R(e)?i1 to mri.

7
Scoring Function
  • Objective is to find a pattern that maximizes a
    suitable scoring function score.
  • The paper concentrates on scoring functions whose
    values for a pattern ? depend on values cumulated
    over the strings in S that match ?.
  • Scoring function score takes parameters M(?)
    and R(?).
  • Also assumed that the score value computation can
    be done in constant time if the parameter values
    are known.
  • Specific choice for the scoring function highly
    depends on the particular application.

8
Problem Definition
  • Given a set Ss1, . . ., sm of strings, where
    each string si is assigned a numeric attribute
    value ri and a scoring function score
    RxRgtR, find the Boolean pattern pair
    ?ltF,p,qgt p,q?,FF0,,F15 that
    maximizes score(M(?),R(?)).

9
Suffix tree GST
  • Edges are labeled with substrings of s.
  • For a node v, l(v) is the string obtained by
    concatenating edge labels from root to v.
  • For each leaf node v, l(v) is a distinct suffix
    of s for each suffix there exists a leaf v.
  • Each node has at least 2 children first
    character of the labels on the edges to its
    children are distinct.
  • GST Given a set Ss1, . . ., sm GST is a
    suffix tree for the String s11. . .smm where
    each i is a distinct character that does not
    belong to ?.
  • All paths are ended at the first appearance of i
    and each leaf is labeled with idi.
  • O(N) space and time.

10
Suffix tree
S caggaggaccat. The paths of the suffix tree
from the root to the leaves (suffixes) are sorted
in lexicographic order from left to right, each
leaf corresponding to a position in the suffix
array. The integer in the suffix array represents
the position in the string from which the
corresponding suffix starts. Asij indicates
sjn is the ith suffix in the lexicographic
ordering The lcp array represents the length of
the longest path that consecutive suffixes in the
suffix array share.
11
GST (Generalized Suffix Tree)
A Generalized Suffix Tree and its corresponding
suffix array for the strings facct, gctt, ctctg.
12
A Naïve O(N3) Algorithm
  • Let N ?i1 to mlength(si)
  • O(N) candidates for a single pattern? patterns of
    form l(v), where v is a node in the GST over the
    set S. (Why???)
  • Hence O(N2) candidate pattern pairs
  • For a given pair ltF,l(v1),l(v2)gt, the values
    M(?) and R(?) can be computed in O(N) time by
    any of the linear time string matching
    algorithms.
  • Then scoring function value is calculated in
    constant time given M(?) and R(?).
  • TimeO(N3). SpaceO(N) for Suffix tree.

13
O(N2) Algorithm
  • Two steps
  • Find M(l(v)) and R(l(v)) for all nodes v of GST
    in O(N) time and space
  • Solve optimal pair of substring patterns problem
    in O(N2) time and O(N) space for any scoring
    function score provided that it can be calculated
    in constant time given its inputs.

14
Algorithm- First step
  • If R(l(v)) for all v can be found in O(N) time so
    can be M(l(v). (when ri1 for all i,
    R(l(v)M(l(V))
  • Let LF(v) be the set of all leaf nodes in the
    subtree rooted by node v.
  • Let ci(v) denote the number of leaves in LF(v)
    that have the label idi.
  • Let sum of leaf attributes be ?LF(v)ri.

15
Algorithm- First step
  • ?LF(v)ri ?iM(l(v))(ci(v).ri)
  • R(l(v)) ?iM(l(v))ri ?LF(v)ri -
    ?iM(l(v))((ci(v)-1).ri) (1)
  • Let correction factor be corr(l(v),S)?iM(l(v))((
    ci(v)-1).ri)
  • In (1) ?LF(v)ri can be calculated for all v using
    a linear time post-order traversal as ?LF(v)ri
    ?v(?LF(v)ri v is a child node of v).

16
Algorithm- First step
  • How to remove the redundancies (correcting
    factors) in (1)?
  • Let I(idi) be the list of all leaves with the
    label idi in the order they appear in the
    post-order traversal of the tree. Constructing
    the lists I can be done in linear time for all
    labels idi.
  • The leaves in LF(v) with the label idi form a
    continuous interval of length ci(v) in the list
    I(idi).
  • If ci(v) gt 0, a length-ci(v) interval in I(idi)
    contains (ci(v)-1) adjacent (overlapping) leaf
    pairs.
  • If x,y LF(v), the node lca(x,y) belongs to the
    subtree rooted by v.
  • For any si S, ?(l(v),si)true, that is, i
    M(l(v)) if and only if there is a leaf x LF(v)
    with the label idi.

17
Algorithm- First step
  • Initially correction value0 for all v.
  • For each adjacent leaf pairs in I(idi) add ri to
    the correction value of the node lca(x,y).
  • For each v, sum of correction values in the nodes
    of the sub-tree rooted by v is (ci(v)-1).ri.
  • Repeat this for all lists I(idi)- the preceding
    total sum becomes ?iM(l(v))((ci(v)-1).ri)
    corr(l(v),S)
  • Perform a linear time bottom-up (post-order)
    traversal to find R(l(v)).

18
Algorithm- First step
Correction values at v1,v2,v3 set to r3,r2,r3
V3r3r2r3-r3 r2r3R(l(v3)) V2R(l(v3))r2-r2
r2r3R(l(v2)) V1r1R(l(v2))r3-r3
r1r2r3R(l(v1))
19
Pseudo code for Step 1
20
O(N2) Algorithm
  • Two steps
  • Find M(l(v)) and R(l(v)) for all nodes v of GST
    in O(N) time and space
  • Solve optimal pair of substring patterns problem
    in O(N2) time and O(N) space for any scoring
    function score provided that it can be calculated
    in constant time given its inputs.

21
Algorithm- Second step
  • O(N) choices for the first pattern?l(v1)
  • For each l(v1)? use a modified version of the
    previous algorithm for the O(N) choices for the
    second pattern,l(v2)
  • given a fixed l(v1), we additionally label each
    string siS and the corresponding leaves in the
    GST with the Boolean value ?(l(v1),si)? O(N)
    time.
  • Cumulate the sums and correction values
    separately for true and false values of the
    additional label.

22
Algorithm- Second step
  • ?iM(l(v2))(ri ?(l(v1),si) true)
    ?iM(l(v2))(ri ?(l(v1),si) true,
    ?(l(v2),si) true) R(ltF8,l(v1),l(v2)gt)
  • ?iM(l(v2))(ri ?(l(v1),si) false)
    ?iM(l(v2))(ri ?(l(v1),si) false,
    ?(l(v2),si) true) R(ltF2,l(v1),l(v2)gt)
  • ?iM(l(v2))(ri ?(l(v1),si) true)
    ?iM(l(v2))(ri ?(l(v1),si) true,
    ?(l(v2),si) false) R(ltF4,l(v1),l(v2)gt)
    R(l(v1)) - R(ltF8,l(v1),l(v2)gt)
  • ?iM(l(v2))(ri ?(l(v1),si) false)
    ?iM(l(v2))(ri ?(l(v1),si) false,
    ?(l(v2),si) false) R(ltF1,l(v1),l(v2)gt)
    R(e) R(l(v1) - R(ltF2,l(v1),l(v2)gt)
  • where R(e) R(l(v1) can be computed in linear
    time.

23
Algorithm- Second step
  • All cumulative values of the form ?i(ri
    ?(l(v1),si) b1, ?(l(v2),si)b2) where
    b1,b2true,false can be computed in linear
    time.
  • Thus R(ltF,l(v1),l(v2)gt) and hence the score can
    be computed in linear time for all pairs of the
    form ltF,l(v1),l(v2)gt, given a fixed l(v1).
  • Thus O(N2) for all pattern pairs.
  • Since the O(N) calculations for each l(v1) is
    independent, the same GST can be reused. Hence
    the space complexity is O(N).

24
Algorithm- Second step
25
The rest of the paper in a nutshell
  • Extension for k-ary Boolean function.
  • Implementation using suffix arrays.
  • Computational experiments and results.
  • Algorithm Variations? Multiple String Attributes,
    Distance Restrictions.

26
Homework
  • Explain the implementation of the Optimal Boolean
    Pattern Pair problem using suffix arrays in your
    own words. Also explain why is it more efficient
    than the suffix tree approach.

Email siv_at_tamu.edu
27
  • THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com