Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm

Description:

Wild-card cannot be at the beginning or end of the sequence. E.g: A..G ... Gap: one or several consecutive wild-cards. E.g: In A..G, '..' is a gap of length 2 ... – PowerPoint PPT presentation

Number of Views:215

Avg rating:3.0/5.0

Slides: 28

Provided by: hkucsis

Category:

more less

Transcript and Presenter's Notes

Title: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm

1
Combinatorial pattern discovery in biological
sequences the TEIRESIAS algorithm
Bioinformatics 1998
Authors Isidore Rigoutsos
Aris Floratos

Speaker Minghua ZHANG

March. 12, 2003
2
Content

Background
Problem definition
Algorithm
Performance
Future work
Conclusion

3
Background

The development of biology provides us with a lot
of information previously unknown.
DNA sequences, protein sequences
One way to make use of the information discovery
patterns in biological sequences.
Application
Classification
Given a protein family, if we find a pattern P of
the family
Later when a new protein comes in, if the protein
matches P, if is very likely the new protein also
belongs to the same family.
Function prediction.

4
Problem Definition

? the alphabet of the characters
E.g for DNA sequences, ?A,C,G,T.
Pattern
A sequence of characters in ?, and wild-cards
(.)
Wild-card cannot be at the beginning or end of
the sequence.
E.g A..G is a pattern AG. is not.

5
Problem Definition (contd)

Given a pattern P
Its length P No. of characters and wild-cards
in P.
char(P) No. of characters in P.
E.g. A.G.T5, char(A.G.T)3.
Its subpattern a substring of P, which itself is
a pattern.
A.G is a subpattern of A.G.T while A.G. is not.
Its language G(P) p p is a pattern obtained
by replacing all wild-cards in P with characters
in ?
E.g. G(A.G) AAG, ACG, AGG, ATG .

6
Problem Definition (contd)

Given a character sequence s, we say it matches a
pattern P in position y
if a substring of s is in G(P) and
the substring begins from offset y in s.
E.g if SACGATCA, PCG.T, then s matches P in
position 2.
Given a sequence database Ds1, s2, , sn
support(P) No. of sequences in D that match P.
List(P) (x,y) sx matches P in position y
Frequent
If support (P) K

7
Problem Definition (contd)

pattern
L, W are 2 integers, 0
P is an pattern
? subpattern Q of P with char(Q)L, we have Q W.
E.g A.G.C.T is a pattern, A..G.C.T is not
a pattern.

8
Problem Definition (contd)

Pattern P is more specific than P
P is obtained from P by
Replacing some wild-cards to characters, and/or
Appending a string to the left/right of P
E.g PA..G, P1AC.G, or P2A..GT.
List(P)
Given a set of patterns, a pattern P is maximal
if
?p, p is more specific than p
we have List(p)
E.g PA.GT, PACGT, if List(P)List(P)(1,1),
(2,2), (3,1), then P is not maximal.

9
Problem Definition (contd)

Problem
Inputs
A character sequence database D
Support threshold K
parameters L, W.
Outputs
p p is an pattern p is frequent p
is maximal char(p)L.

10
Algorithm

Two parts
1. Find elementary patterns.
Frequent patterns having exactly L
characters
2. Find required patterns.

11
Algorithm part 1

Find the set of elementary patterns EP

EPempty Scan D For all c in ? if
support(c) K Extend(c) Algorithm

Extend(pattern P)
If char(P)L add P to EP return
For i0 to W- P - L char(P)
P P i dots
for all (x,y) in List(P)
if (y P i )
a SX y P i
update List(P a)
For all c in ?
if List( Pc ) K
Extend ( Pc )
Function

12
Algorithm part 2

Terminologies
Given a pattern P with char(P) L,
Prefix(P) its subpattern containing the first
L-1 characters.
Suffix(P) its subpattern containing the last L-1
characters.
E.g L3, prefix(A.G.TG)A.G, suffix(A.G.TG)TG

13
Algorithm part 2 (contd)

Terminologies
Convolution
If suffix(P) prefix(Q), then
conv(P, Q) PQ (Qprefix(Q)Q)
E.g PAC.G, QC.GT, then QT, conv(P,Q)AC.GT
If Rconv(P,Q), then
List(R) (x,y) (x,y) in List(P) (x, y P
- suffix(P)) in List(Q)

14
Algorithm part 2 (contd)

Terminologies
Partial orders on a set of patterns
prefix_wise less than (
Align two patterns along their left-most columns
Find the first column with a . and a character
The pattern with character in the column is
smaller.
E.g A . G T
A G T . C
So AGT.C

15
Algorithm part 2 (contd)

Terminologies
Partial orders on a set of patterns
suffix_wise less than (
E.g A . G T
A G T . C
So A.GT

16
Algorithm part 2 (contd)

Works on a stack
At first, push all elementary patterns into the
stack.
Prefix-wise less patterns are pushed later, so
that they are closer to the top of the stack.
Deal with the top pattern one by one
until the stack is empty

17
Algorithm part 2 (contd)

P is on top of the stack
Extend P to the right
find elementary patterns whose prefix is the same
as suffix(P).
convolution P with the patterns one by one with
prefix-wise less pattern first.
If Rconv(P,Q).
If List( R ) List(P), pop stack.
If support(R) K and R is maximal push R,
starts over again.
Extend P to the left.
Pop P from stack. If P is maximal, report it.

18
Performance

Experiments
effectiveness
real data
successfully find patterns previously known by
biologist.
efficient
synthetic data

19
Performance (contd)

Core histones H3 and H4
Database 13 proteins from H3 family, 7 from H4.
L3, W35
Find 9 patterns in all 20 proteins
For those patterns with the largest No. of
characters

20
Performance (contd)

Synthetic data generation
Use the generator in http//www.expasy.ch/sprot/ra
ndseq.html to get
A random protein P of 400 amino acids
Obtain 20 derivative proteins from P
With a X similarity to P

21
Performance (contd)

Synthetic data
X40, 50, 60, 70, 80, 90
L3, W10 (15)
K12, 14, 16, 18, 20
Results
The algorithm is efficient.
The running time is almost linear on the actual
size of the output.

22
Performance (contd)

Algorithm analysis
Bad factors
ACGT and GTA will generate ACGTA. However, CGTA
may be infrequent. More candidates are generated.
Good factors
By use partial order of specific patterns will be generate before less
specific patterns.
E.g ACG is considered before AC.T ? ACGTG is
generated before AC.TG ? If AC.TG is not maximal
due to ACGTG, it will not be pushed into stack
for further candidate generations
When R is generated from P, if List( R )
List(P), P will be popped out of the stack.
Reduce no. of candidates

23
Performance (contd)

A simple comparison
Compared with an Apriori-like algorithm, which
finds all patterns, not only maximal ones.
X90, L3, W10, K16

( Running time does not include time to output
patterns )
24
Future work

Find more flexible patterns
Pattern with ambiguous characters
E.g. ACGT means a pattern, which can match both
ACT and AGT. (IALYD2000)
Pattern with flexible gaps
Gap one or several consecutive wild-cards
E.g In A..G, .. is a gap of length 2
Flexible gap
A-x(3,5)-G
A--G.

25
Conclusion

TEIRESIAS can find patterns composed of
characters and wild-cards.
It can successfully find out patterns from
biological sequences.
Its running time is roughly linear to the size of
the output.
Future research discover more flexible patterns.

26
Reference

IA1998 Combinatorial pattern discovery in
biological sequences the TEIRESIAS algorithm.
Isidore Rigoutsos, etal. Bioinformatics, Vol 14,
No. 1, 1998, 55-67.
IALYD2000 The emergence of pattern discovery
techniques in computational biology. Isidore
Rigoutsos, etal. Metabolic Engineering 2, 159-177
(2000)
AI On the time complexity of the TEIRESIAS
algorithm. Aris Floratos, etal. IBM research
report.