Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm

Description:

Wild-card cannot be at the beginning or end of the sequence. E.g: A..G ... Gap: one or several consecutive wild-cards. E.g: In A..G, '..' is a gap of length 2 ... – PowerPoint PPT presentation

Number of Views:215
Avg rating:3.0/5.0
Slides: 28
Provided by: hkucsis
Category:

less

Transcript and Presenter's Notes

Title: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm


1
Combinatorial pattern discovery in biological
sequences the TEIRESIAS algorithm
Bioinformatics 1998
Authors Isidore Rigoutsos
Aris Floratos
  • Speaker Minghua ZHANG

March. 12, 2003
2
Content
  • Background
  • Problem definition
  • Algorithm
  • Performance
  • Future work
  • Conclusion

3
Background
  • The development of biology provides us with a lot
    of information previously unknown.
  • DNA sequences, protein sequences
  • One way to make use of the information discovery
    patterns in biological sequences.
  • Application
  • Classification
  • Given a protein family, if we find a pattern P of
    the family
  • Later when a new protein comes in, if the protein
    matches P, if is very likely the new protein also
    belongs to the same family.
  • Function prediction.

4
Problem Definition
  • ? the alphabet of the characters
  • E.g for DNA sequences, ?A,C,G,T.
  • Pattern
  • A sequence of characters in ?, and wild-cards
    (.)
  • Wild-card cannot be at the beginning or end of
    the sequence.
  • E.g A..G is a pattern AG. is not.

5
Problem Definition (contd)
  • Given a pattern P
  • Its length P No. of characters and wild-cards
    in P.
  • char(P) No. of characters in P.
  • E.g. A.G.T5, char(A.G.T)3.
  • Its subpattern a substring of P, which itself is
    a pattern.
  • A.G is a subpattern of A.G.T while A.G. is not.
  • Its language G(P) p p is a pattern obtained
    by replacing all wild-cards in P with characters
    in ?
  • E.g. G(A.G) AAG, ACG, AGG, ATG .

6
Problem Definition (contd)
  • Given a character sequence s, we say it matches a
    pattern P in position y
  • if a substring of s is in G(P) and
  • the substring begins from offset y in s.
  • E.g if SACGATCA, PCG.T, then s matches P in
    position 2.
  • Given a sequence database Ds1, s2, , sn
  • support(P) No. of sequences in D that match P.
  • List(P) (x,y) sx matches P in position y
  • Frequent
  • If support (P) K

7
Problem Definition (contd)
  • pattern
  • L, W are 2 integers, 0
  • P is an pattern
  • ? subpattern Q of P with char(Q)L, we have Q W.
  • E.g A.G.C.T is a pattern, A..G.C.T is not
    a pattern.

8
Problem Definition (contd)
  • Pattern P is more specific than P
  • P is obtained from P by
  • Replacing some wild-cards to characters, and/or
  • Appending a string to the left/right of P
  • E.g PA..G, P1AC.G, or P2A..GT.
  • List(P)
  • Given a set of patterns, a pattern P is maximal
    if
  • ?p, p is more specific than p
  • we have List(p)
  • E.g PA.GT, PACGT, if List(P)List(P)(1,1),
    (2,2), (3,1), then P is not maximal.

9
Problem Definition (contd)
  • Problem
  • Inputs
  • A character sequence database D
  • Support threshold K
  • parameters L, W.
  • Outputs
  • p p is an pattern p is frequent p
    is maximal char(p)L.

10
Algorithm
  • Two parts
  • 1. Find elementary patterns.
  • Frequent patterns having exactly L
    characters
  • 2. Find required patterns.

11
Algorithm part 1
  • Find the set of elementary patterns EP

EPempty Scan D For all c in ? if
support(c) K Extend(c) Algorithm
  • Extend(pattern P)
  • If char(P)L add P to EP return
  • For i0 to W- P - L char(P)
  • P P i dots
  • for all (x,y) in List(P)
  • if (y P i )
  • a SX y P i
  • update List(P a)
  • For all c in ?
  • if List( Pc ) K
  • Extend ( Pc )
  • Function

12
Algorithm part 2
  • Terminologies
  • Given a pattern P with char(P) L,
  • Prefix(P) its subpattern containing the first
    L-1 characters.
  • Suffix(P) its subpattern containing the last L-1
    characters.
  • E.g L3, prefix(A.G.TG)A.G, suffix(A.G.TG)TG

13
Algorithm part 2 (contd)
  • Terminologies
  • Convolution
  • If suffix(P) prefix(Q), then
  • conv(P, Q) PQ (Qprefix(Q)Q)
  • E.g PAC.G, QC.GT, then QT, conv(P,Q)AC.GT
  • If Rconv(P,Q), then
  • List(R) (x,y) (x,y) in List(P) (x, y P
    - suffix(P)) in List(Q)

14
Algorithm part 2 (contd)
  • Terminologies
  • Partial orders on a set of patterns
  • prefix_wise less than (
  • Align two patterns along their left-most columns
  • Find the first column with a . and a character
  • The pattern with character in the column is
    smaller.
  • E.g A . G T
  • A G T . C
  • So AGT.C

15
Algorithm part 2 (contd)
  • Terminologies
  • Partial orders on a set of patterns
  • suffix_wise less than (
  • E.g A . G T
  • A G T . C
  • So A.GT

16
Algorithm part 2 (contd)
  • Works on a stack
  • At first, push all elementary patterns into the
    stack.
  • Prefix-wise less patterns are pushed later, so
    that they are closer to the top of the stack.
  • Deal with the top pattern one by one
  • until the stack is empty

17
Algorithm part 2 (contd)
  • P is on top of the stack
  • Extend P to the right
  • find elementary patterns whose prefix is the same
    as suffix(P).
  • convolution P with the patterns one by one with
    prefix-wise less pattern first.
  • If Rconv(P,Q).
  • If List( R ) List(P), pop stack.
  • If support(R) K and R is maximal push R,
    starts over again.
  • Extend P to the left.
  • Pop P from stack. If P is maximal, report it.

18
Performance
  • Experiments
  • effectiveness
  • real data
  • successfully find patterns previously known by
    biologist.
  • efficient
  • synthetic data

19
Performance (contd)
  • Core histones H3 and H4
  • Database 13 proteins from H3 family, 7 from H4.
  • L3, W35
  • Find 9 patterns in all 20 proteins
  • For those patterns with the largest No. of
    characters

20
Performance (contd)
  • Synthetic data generation
  • Use the generator in http//www.expasy.ch/sprot/ra
    ndseq.html to get
  • A random protein P of 400 amino acids
  • Obtain 20 derivative proteins from P
  • With a X similarity to P

21
Performance (contd)
  • Synthetic data
  • X40, 50, 60, 70, 80, 90
  • L3, W10 (15)
  • K12, 14, 16, 18, 20
  • Results
  • The algorithm is efficient.
  • The running time is almost linear on the actual
    size of the output.

22
Performance (contd)
  • Algorithm analysis
  • Bad factors
  • ACGT and GTA will generate ACGTA. However, CGTA
    may be infrequent. More candidates are generated.
  • Good factors
  • By use partial order of specific patterns will be generate before less
    specific patterns.
  • E.g ACG is considered before AC.T ? ACGTG is
    generated before AC.TG ? If AC.TG is not maximal
    due to ACGTG, it will not be pushed into stack
    for further candidate generations
  • When R is generated from P, if List( R )
    List(P), P will be popped out of the stack.
  • Reduce no. of candidates

23
Performance (contd)
  • A simple comparison
  • Compared with an Apriori-like algorithm, which
    finds all patterns, not only maximal ones.
  • X90, L3, W10, K16

( Running time does not include time to output
patterns )
24
Future work
  • Find more flexible patterns
  • Pattern with ambiguous characters
  • E.g. ACGT means a pattern, which can match both
    ACT and AGT. (IALYD2000)
  • Pattern with flexible gaps
  • Gap one or several consecutive wild-cards
  • E.g In A..G, .. is a gap of length 2
  • Flexible gap
  • A-x(3,5)-G
  • A--G.

25
Conclusion
  • TEIRESIAS can find patterns composed of
    characters and wild-cards.
  • It can successfully find out patterns from
    biological sequences.
  • Its running time is roughly linear to the size of
    the output.
  • Future research discover more flexible patterns.

26
Reference
  • IA1998 Combinatorial pattern discovery in
    biological sequences the TEIRESIAS algorithm.
    Isidore Rigoutsos, etal. Bioinformatics, Vol 14,
    No. 1, 1998, 55-67.
  • IALYD2000 The emergence of pattern discovery
    techniques in computational biology. Isidore
    Rigoutsos, etal. Metabolic Engineering 2, 159-177
    (2000)
  • AI On the time complexity of the TEIRESIAS
    algorithm. Aris Floratos, etal. IBM research
    report.

27
  • ?
Write a Comment
User Comments (0)
About PowerShow.com