Finding Regulatory Signals in Genomic Sequences

Weeder ProFind

Giancarlo Mauri

Bioinformatics and Natural Computing

Group Dipartimento di Informatica, Sistemistica e

Comunicazione Università degli Studi di

Milano-Bicocca

Gene Expression Data

- When and how much a gene is expressed under some

given conditions (tissue, external stimuli,

disease...) - We can group genes according to their expression

profile - We can suspect a common cause for their

expression

Transcription Factors

- The expression of a gene starts with

transcription from DNA to RNA - Transcription is modulated by dedicated proteins

called transcription factors (TFs) - TFs bind to DNA in the regions surrounding the

starting site of the gene (mostly upstream), and

direct polymerases to the right spot to start

transcription - Different effects may enhance or block

transcription

Transcription Factors

TF

Transcription Starts

TFBS

AND

TF1

TF2

Transcription Starts

TFBS

NOT

TF1

TF2

Transcription Starts

TFBS

TFs Binding Sites (TFBSs)

Fundamental in regulatory analysis is the

identification of potential TFBSs

- Bound by transcription factors
- Short degenerate sequences, 5-16 nucleotides

long, (gaps possible but rare) - Each TF does not recognize a single fragment but

a set of them (similar to each other) called

signal or motif - Can be illustrated by profiles and/or consensi

(computational models)

TFs Binding Sites (TFBSs)

- Fundamental in regulatory analysis is the

identification of potential TFBSs

- Bound by transcription factors
- Short degenerate sequences, 5-16 nucleotides

long, (gaps possible but rare) - Each TF does not recognize a single fragment but

a set of them (similar to each other) called

signal or motif - Can be illustrated by profiles and/or consensi

(computational models)

Finding TFBSs

- We have a set of related genes
- similar expression profile
- similar biological function
- anything else..
- We take their upstream regulatory regions
- If they are regulated by the same TF(s), then we

should find its (their) binding sites in the

sequences - We should find short patterns conserved in the

sequences - We could use the detected TFBS to predict the

behavior of a gene

Finding Novel TFBSs

- Over-representation
- First, detect groups of similar oligos
- Describe each group with a consensus or a profile

(or in some other smart way) - Find the most over-represented groups

If sequences were built at random and/or we

picked sequences at random, the group should not

appear with the same size/conservation

Finding Novel TFBSs

- Most of early research has focused on the first

point how to detect the best groups

(unfortunately, there are thousands of

candidates) given simple score measures - Recent research has followed the second point

which is the best measure to tell significant

groups from random similarities? - Is it expected or not, to find a group that is

conserved? - Can we take advantage from the wealth of sequence

data available?

Weeder A tool for pattern discovery in Genomic

Sequences

Giancarlo Mauri Giulio Pavesi Graziano Pesole

Università degli Studi di Milano-Bicocca

Università degli Studi di Milano

References

- G. Pavesi, G.Mauri, G.Pesole. An Algorithm for

Finding Signals of Unknown Length in Unaligned

DNA Sequences. Bioinformatics 17, S207-S214,

2001 - G.Pavesi, P.Mereghetti, G.Mauri, G.Pesole.

WeederWeb Discovery of Transcription Factor

Binding Sites in a Set of Sequences from

Co-Regulated Genes. Nucleic Acids Research Web

Server Issue 2004, 32 W199-W203 - http//159.149.109.168080/weederWeb/

Weeder (2001)

- Idea instead of reducing the set of candidate

patterns, reduce the set of possible matches for

each pattern, trying to save a significant

number of valid occurrences - Instead of searching exhaustively for patterns

that occur in every sequence, we

short-sightedly look for patterns that occur in

a subset of them - The algorithm needs as input only a given error

ratio e

Suffix Trees

- A suffix tree is a data structure that exposes

the internal structure of a string in a very deep

and meaningful way - Suffix tree T for S s1sn
- rooted directed tree
- exactly n leaves numbered 1 to n
- internal nodes with at least two sons
- edges labeled by non empty substrings of S
- labels out of the same node begin with different

symbols - the concatenation of the edge labels on the path

from the root to any leaf i exactly spells the

suffix of S starting at position i, i.e., s1sn

Suffix Trees

- The same structure can be built also for a set of

k sequences - To distinguish which sequence a suffix belongs

to, it appends a different marker symbol, not

occurring elsewhere, to each sequence in the set. - It is also possible to annotate each node of the

tree with a k-bit string, where the i-th bit is

set if the word spelled by the path ending at the

node occurs in the i-th sequence.

Suffix Trees

Suffix tree for ACCA (end with ) and CCAAG (end

with )

Suffix Trees

- A generalized suffix tree can be built in O(N)

time and takes O(N) space, where N is the overall

length of the sequences - Annotating it with the bit strings takes

additional O(kN) time - Each pattern occurring in the strings is spelled

by a path starting from the root of the tree - The time needed to search for a pattern depends

only on the length of the pattern - The structure allows to implement recursively the

exhaustive enumeration of all the candidate

patterns of a given length - The time complexity is thus reduced to be

exponential in the maximum number of mutations

allowed (Sagot, 1998)

Searching for an Exact Pattern

- Given a set of sequences and the annotated suffix

tree, every pattern appearing in at least one

sequence of the set is spelled by a unique path

starting from the root - We match the symbols of pattern p along the

unique path in the tree until - p is exhausted
- In this case, the bit string on the next node on

the path specifies which sequences p appears in - no more matches are possible

Searching for a Pattern with Mismatches

- We can also search for a pattern p with at most e

mismatches in a similar way. - We match p along different paths on the tree at

the same time, keeping track of the number of

mismatches encountered on each path. - Whenever the number of errors on a path is

greater than e, we discard that path. - The sequences p appears in are given by the

logical-OR of the bit strings corresponding to

the different paths.

Searching for (M, e) Patterns

- The algorithm starts with the empty pattern from

the root of the tree, and recursively expands it - Let us suppose we have found on the tree the

endpoints of paths corresponding to the

occurrences of a pattern pp1pm in the

sequences, where all the paths spell words within

distance e from p, with mltM - If p occurs in at least q sequences, we try and

expand it by one symbol

Searching for (M, e) Patterns

- Expanding a pattern by one symbol
- For each character b ? A, C, G, T, we match b

against the next symbol on each path - If a path ends just before a node V of the tree,

we match b against the first symbol on each edge

leaving V - When we encounter a mismatch, we increase the

previous error along the path by one - If the new error is greater than e, we discard

the path

Searching for (M, e) Patterns

- Once all paths have been checked, the surviving

ones represent the approximate occurrences of

pp1pmb - If p occurs in at least q sequences, and is

shorter than M, we expand p as well. Otherwise,

we continue with p and the next character in ?

Searching for (M, e) Patterns

- For example
- It matches the first symbol on each edge leaving

the root against A. - If A is valid, i.e., A occurs in at least q

sequences, it is expanded to AA. - If also AA is valid, we move to AAA, and so on.
- If it is not valid, we proceed to look for

occurrences of AC. - In this method, patterns dont have to occur

exactly in the sequences.

Searching for (M, e) Patterns

- The main drawback is that every pattern of length

e satisfies the input constraints, since every

other pattern of length e found in the tree is a

valid occurrence for it - Thus, the method works well only for small values

of e

Searching for (M, e) Patterns

At the beginning of the search, all paths of

length e are valid

Searching for (M, e) Patterns

- To apply the algorithm also to longer patterns

with higher values of e, instead of reducing the

set of patterns that have to be searched, we

restrict the number of paths that have to be

followed for each pattern. - That is, we narrow down the set of valid

occurrences-the WEEDER algorithm.

Searching for Approximate Occurrences of Patterns

- Problem Definition
- Given a set of k sequences on the alphabet ?

A, C, G, T , we want to find all (M, e)

patterns - (M, e) patterns patterns of length M that occur

with at most e mismatches in at least q sequences

of the set

The Outline of WEEDER

- WEEDER fixes an initial error ratio ?
- Given a pattern p, a path is valid if the

distance from p to the path is not greater than

?? p? - p is the length of the pattern
- When we expand p by one symbol, the error

threshold is set to ?? (p1)?

Block Decomposition of a Pattern

- Each block size is ?1/??
- Let p p1pm. We can see p as composed of

??m? blocks

Valid Occurrences

- For every pattern p p1pm, valid occurrences

are words si1sim occurring in the sequences

for which ? j ?1,, m d(p1pj, si1sij) ?

??j? - d(p1pj, si1sij) is the number of mismatches

between p1pj and si1sij - si1sim is a valid occurrence for p if it is a

valid occurrence for all its prefixes - p1, p1p2, , p1p2pm-1

An Example for WEEDER

G

C

A

GCTCA

C

T

T

ATCACGC

C

GC

GC

T

T

C

CA

TC

GCT

A

C

CACGC

A

A

GCT

CGC

CGC

ACTC? ACTCA

G

C

A

GCTCA

T

C

T

ATCACGC

C

GC

GC

T

CA

T

C

TC

GCT

A

C

CACGC

A

A

GCT

CGC

CGC

ACTC error max 1. S1, S2, S4 contain ACTC.

S2 AGCTCA

S1 AATCACGC

S3 ATGCT

S4 ACTC

ACTCA ? ACTCAA

G

C

A

GCTCA

T

C

T

ATCACGC

C

GC

GC

T

T

C

CA

TC

A

GCT

C

CACGC

A

A

GCT

CGC

CGC

ACTCAA error max 2. S1, S2 contain

ACTCAA. ACTCAC, ACTCAG, ACTCAT are also patterns.

S1 AATCACGC

S2 AGCTCA

S4 ACTC

S3 ATGCT

Weeder (2001)

- Given a pattern P p1p2....pm, the algorithm

can find all the valid occurrences of P (with at

most ?P mutations), such that at most ?i

mutations occur in the first i letters of the

pattern - But some occurrences of a pattern can be missed

altogether - Are DNA signals always so polite to show up in

blocks-decomposed form? - The answer is no, but we can use Weeder with a

grain of salt

Using Weeder

- Example (15,4) pattern occurring in 20 sequences
- Valid (block decomposed) possible occurrences

829 - Total possible occurrences1365
- Probability if hitting a possible occurrence in

a sequence phit.61 - Probability of finding the pattern in every

sequence like trying to win the national lottery

- If we search for patterns occurring in at least

10 sequences, the probability of seeing at

least 10 times the pattern is - Phit(20,10) .89

Using Weeder

- Thus, we can use Weeder as a sieve, to filter the

set of candidate patterns - All patterns that are found to occur in at least

q of the sequences by Weeder can be searched

again in the sequences, but this time with no

restriction on the position of mismatches - We expect the number of patterns (random patterns

other than the real signal) passed to the second

phase to be much smaller than the original number

(and no longer exponential)

Using Weeder

- The probability of finding a pattern in a

sequence depends on its length and the error

ratio - The probability of finding a pattern in a set of

sequences (and thus the choice of the quorum q

for the first phase) depends on the number of

sequences - The same approach can be applied also when the

signal does not show up in each sequence

Using Weeder

- When the signal to be found is expected to be

short, the algorithm can be used in exact mode - For longer signal, the lower is the quorum q, the

higher is the probability of finding the signal - But also the number of patterns satisfying the

input constraints is higher, and the program is

slower - Users can choose a suitable trade-off between

time and accuracy

Theoretical Time Complexity

- Naïve approach O(4men)
- Suffix tree approach (Sagot, 1998) O(4emekn)
- where n is the input size, m is the pattern

length, and e is the number of mutations allowed - Weeder
- O((1/?)e4ekn)
- where e is the number of mutations occurring in

the longest pattern found

Weeder Web

- Weeder Web is a web interface to the Weeder

algorithm, where all the parameters concerning

the motifs are automatically set for the

discovery of transcription factor binding sites - Although there is no pre-set limit on the length

of the input sequences, feasible results can be

obtained by submitting sequences of "typical"

length for regulatory/promoter regions (i.e. from

500 to 5000 bps) - A priori, there's no limit on the number of

sequences you can input. Also, for the moment we

do not consider correlations among different

motifs (i.e. cis-regulatory modules)

Weeder Web

- All the statistical measures (background oligo

frequencies, expected occurrences and so on) used

to score/rank motifs and to post-process the

output have been derived from the analysis of

promoter/enhancer and 5'UTR regions only (taken

from different organisms) - If you submit something else (i.e. 3'UTRs, coding

regions, noncoding RNAs, and so on) the

statistical evaluation probably will not be

consistent with your data, and thus produce

unreliable results - http//www.pesolelab.it/Tool/ind.php

Post-Processing

- Real motifs have different degrees of variation

in different positions - Some admit any nucleotide
- Some are (almost) perfectly conserved
- We should find redundant motifs among the

highest-scoring ones - Pieces of a long motif should appear also in

shorter results

Post-Processing

- Look for redundant (either in length or in

conservation) motifs in the reports of each run - Collect the instances of each one and build a

frequency matrix - Scan the sequences looking for matches
- Report the best matches (with no constraint on

the substitutions allowed)

Assessment results

Tompa et.al., Nat Biotechnol. 2005

Jan23(1)137-44.

Assessment results

ProFind A GA Approach to the Definition of

Regulatory Signals in Genomic Sequences

- Characterization of CAP and TATA-box through

probability matrices with a genetic algorithm

Giancarlo Mauri, Roberto Mosca and Giulio Pavesi

Bioinformatics and Natural Computing

Group Dipartimento di Informatica, Sistemistica e

Comunicazione Università degli studi di

Milano-Bicocca

TATA Box and CAP Binding Sites

- A large number of genes present two

characteristic signals - TATA-box
- 25-35 bp upstream of the TSS
- When discovered it was given a TATA consensus
- Bound by the TATA Binding Protein (TBP) part of a

large complex of some 50 different proteins

including TFIID and TFIIB - CAP (also called Initiator or Inr)
- Straddles the TSS
- Experimental evidence that it is bound by TFIID

too - Previous characterization by Bucher 1990 with a

CAPy consensus - Very strong positional preference for the two

signals with respect to the TSS

Describing Binding Sites

- Frequency Matrix nij

.........0........... ...CGTGCCATTTGTTGT... ...TCC

TACAGTGCAGCA... ...TCACATATTATTGTC... ...GAAAGCAAC

AACTAA... ...TAAATCGTCAGTGTA... ...CCGACCAGAGTGAAA

... ...GGGTTTGGTTTGATA... ...GCGTGCAGTTGTGAA... ..

.GTCGCCATATACACA... ...GTGGCCGTATGCGCT...

.........0........... ...CGTGCCATTTGTTGT... ...TCC

TACAGTGCAGCA... ...TCACATATTATTGTC... ...GAAAGCAAC

AACTAA... ...TAAATCGTCAGTGTA... ...CCGACCAGAGTGAAA

... ...GGGTTTGGTTTGATA... ...GCGTGCAGTTGTGAA... ..

.GTCGCCATATACACA... ...GTGGCCGTATGCGCT...

-2 -1 0 1 2

A 2 0 7 1 3

C 4 8 0 0 2

G 2 0 3 4 0

T 2 2 0 5 5

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.3

C 0.4 0.8 0 0 0.2

G 0.2 0 0.3 0.4 0

T 0.2 0.2 0 0.5 0.5

Frequency Matrix

- Given a signal of length m, a matrix P of

dimension 4?m is built up, taking

Pi,jPnucleotide i at position j - We suppose the existence of two different models
- The signal
- The background

Signal occurence

Weights stored in the matrix

The Problem

- Dataset

... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...

TCCAGCCCGAACAAAATCGATCAAAAT ... ...

ATCCCTCTGGCCATTGGCAATCGATCC ... ...

AGAAACAAAACGGCTTGTAAAACAAAC ... ...

GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...

GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...

AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...

AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...

AGTTTTCATATTGTTACCGTTTGAGTT ... ...

TTCACCAGCCACTTTCAGTCGGTTTAT ... ...

GCATAACGAATCACTCTGATCGCTGTC ... ...

GGTCCAGCGACCACTCGCAGTTCTACA ... ...

GATCGGCGTGCCATTTGTTGTTGAATC ... ...

CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...

TCCAAGTCACCGATTATTGTCTCAGTG ... ...

GAACTGGAAACCAACAACTAACGGAGC ... ...

TCAGTCTAAATTTACCCTGTAAAATTC ... ...

GTAGTTCCGACCAGAGTGAAACTGAAC ... ...

CTTTATGGGTTTAGTTTGATAGGAGTC ... ...

TCACTGGCGTTGTTAGAGTTGTGAATG ...

... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...

TCCAGCCCGAACAAAATCGATCAAAAT ... ...

ATCCCTCTGGCCATTGGCAATCGATCC ... ...

AGAAACAAAACGGCTTGTAAAACAAAC ... ...

GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...

GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...

AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...

AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...

AGTTTTCATATTGTTACCGTTTGAGTT ... ...

TTCACCAGCCACTTTCAGTCGGTTTAT ... ...

GCATAACGAATCACTCTGATCGCTGTC ... ...

GGTCCAGCGACCACTCGCAGTTCTACA ... ...

GATCGGCGTGCCATTTGTTGTTGAATC ... ...

CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...

TCCAAGTCACCGATTATTGTCTCAGTG ... ...

GAACTGGAAACCAACAACTAACGGAGC ... ...

TCAGTCTAAATTTACCCTGTAAAATTC ... ...

GTAGTTCCGACCAGAGTGAAACTGAAC ... ...

CTTTATGGGTTTAGTTTGATAGGAGTC ... ...

TCACTGGCGTTGTTAGAGTTGTGAATG ...

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.2

C 0.3 0.7 0 0 0.2

G 0.2 0 0.2 0.4 0

T 0.2 0.2 0 0.4 0.5

The Score Function

- The score function is made up of two terms
- A positive term
- A negative term

Total score positive score negative score

The Score Function

The Genetic Algorithm

... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...

TCCAGCCCGAACAAAATCGATCAAAAT ... ...

ATCCCTCTGGCCATTGGCAATCGATCC ... ...

AGAAACAAAACGGCTTGTAAAACAAAC ... ...

GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...

GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...

AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...

AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...

AGTTTTCATATTGTTACCGTTTGAGTT ... ...

TTCACCAGCCACTTTCAGTCGGTTTAT ... ...

GCATAACGAATCACTCTGATCGCTGTC ... ...

GGTCCAGCGACCACTCGCAGTTCTACA ... ...

GATCGGCGTGCCATTTGTTGTTGAATC ... ...

CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...

TCCAAGTCACCGATTATTGTCTCAGTG ... ...

GAACTGGAAACCAACAACTAACGGAGC ... ...

TCAGTCTAAATTTACCCTGTAAAATTC ... ...

GTAGTTCCGACCAGAGTGAAACTGAAC ... ...

CTTTATGGGTTTAGTTTGATAGGAGTC ... ...

TCACTGGCGTTGTTAGAGTTGTGAATG ...

1

1

1

0

1

0

0

0

0

0

1

1

1

1

0

1

0

1

1

0

-2 -1 0 1 2

A 0.2 0 0.7 0.1 0.2

C 0.3 0.7 0 0 0.2

G 0.2 0 0.2 0.4 0

T 0.2 0.2 0 0.4 0.5

The Genetic Algorithm

- One-point crossover
- Mutation operator flips one bit with a

probability pm - Given a genome the frequency matrix is calculated

from the sequences selected by the genome itself - The matrix is scored according to the previously

defined score function - A local optimization procedure is applied to the

resulting best individual

Results

- Three different datasets
- Eukaryotic Promoter Database (EPD,

www.epd.isb-sib.ch, rel. 74) - Vertebrates 2199 seqs, from -50 to 50
- Homo Sapiens 1796 seqs (included in the previous

one) - Drosophila Core Promoters Database (DCPD,

www-biology.ucsd.edui/labs/kadonaga/DCPD.htm) - 205 seqs, from -47 to 44

Results - CAP

- Start position -2
- Length 6
- Population 500
- Generations 20000
- Crossover Probability 0.9
- Mutation Probability 0.04

Results - CAP

- CA, CG, TA and TG dinucleotides starting at

position -1 in for the EPD dataset - CA, TA dinucleotides starting at position -1 for

the DCPD dataset (consistently with

Kadonaga,2000

Kutach, A., Kadonaga, J. The downstream promoter

element DPE appears to be as widely used as the

TATA box in Drosophila core promoters. Mol. Cell

Biol. 20 (2000) 475464

Results - CAP

Results TATA Box

- Start position -30
- Length 8
- Population 500
- Generations 20000
- Crossover Probability 0.9
- Mutation Probability 0.04

Consistent with the finding that the TBP

recognizes the minor grove of DNA, where

protein-DNA interactions are typically influenced

by A/T-content, but not by the specific

nucleotide sequence.

Kim, J.L., Nikolov, D.B., Burley, S.I.

Co-crystal structure of TBP Recognizing the minor

groove of a TATA element. Nature 365 (1993)

520527 Lo, K., Smale, S.T. Generality of a

functional initiator consensus sequence. Gene 182

(1996) 1322

- The signals found have proved to be consistent

with those described experimentally, as in the

case of fruit fly signals - Differently from previous approaches to the same

problem - it does not make any prior assumption about the

signals structure - it takes advantage of the specific localization

of the signals considered - We do not need to know in advance which sequences

contain the signal, this is taken care of by the

algorithm - Further improvements include
- Development of models taking into account

correlation between adjacent positions - Application of the method to new datasets with a

better characterization of the TSS to investigate

positional correlation between the TATA-box and

the TSS and the presence of alternative TSSs

SeQuAl

- Large Scale Multiple Alignment of Genomic

Sequences

Summary

- Multiple Sequence Alignment
- The Methods So Far
- SeQuAl project
- Producing a Threaded Blockset Alignment
- Results possible improvements

Multiple Sequence Alignment (1)

- The multiple sequence alignment problem is the

process of taking three or more input sequences

and forcing them to have the same length by

inserting a universal gap symbol, in order to

maximize their similarity as measured by a score

function.

Multiple Sequence Alignment (2)

Dynamic Programming Vs Heuristics

- Dynamic Programming Techniques
- O(NxM) for two sequences of length N and M
- Exponential for multiple sequences
- O(nxM2) for n sequences of length approximately M

using some heuristics (ClustalW)

Prohibitively time-consuming for sequences of

length exceeding 10 kb

Need for more heuristics to speed up the alignment

The Methods So Far

- Pairwise sequence alignment tools
- MUMmer, GLASS, WABA, BLASTZ
- Multiple sequence alignment tools
- MAVID, MLAGAN, MGA, Mauve
- Anchoring techniques
- Seed ungapped alignments and extensions
- Filling gaps using dynamic programming

SeQuAl project

- Goals
- Ability to find conservation even in the presence

of mutations on amino acids - Ability to find conservation in subsets of the

original set - Anchoring technique based on text indexing

structures - Hashing functions on nucleotides and amino acids

SeQuAl project

- Four steps alignment
- Anchors search
- Anchors chaining
- Anchors refinement
- Gaps filling

New!

New!

Anchors Discovery (1)

- Discovery of MUMs and MEMs using Suffix Trees
- Longer than a minimum threshold
- Shorter than a maximum length for speeding up the

time complexity - Statistically significant
- Searched on
- the nucleotide sequence
- the amino acid sequence
- on the sequence of classes of aminoacids
- Fragments generation

New!

New!

Anchors Discovery (2)

S ACGTCA R TGTCTA

Fragments Chaining

- Ordering relation between fragments

lt

lt

lt

lt

- Graph induced by the ordering relation

Shortest Path on a DAG

Partial Anchors (1)

- Ordering relation?

lt

lt

- Order the partial fragments based on their

barycenter - Use a shifting barrier to impose a total ordering

between partial fragments

Partial Anchors (2)

Producing a Threaded Blockset Alignment

- Block is a rectangular array of symbols such that

removing dashes from any row produces a run of

one or more consecutive positions in one of the

original sequences - Blockset a set of such blocks
- A ref-blockset" consists of a blockset in which

every block has a designated row, all of which

come from the same original sequence, called the

reference for that ref-blockset. - If a blockset is threaded by each of the original

sequences, we call it a threaded blockset.

Local Alignment with ClustalW

Results (1)

Results (2)

- Approximate anchors in non-coding areas of the

genome - Developing a hashing function for regulatory

regions - Important evolutionary events
- Modeling rearrangements and inversions
- Memory-saving implementation of suffix trees
- Custom visualization tool and classification of

aligned regions based on the nature of anchors

Conclusions

- Exhaustive methods
- Suitable for short patterns and a once-for-all

analysis of data (e.g. whole genomes) - Sequence driven methods
- Give faster but less accurate answers
- Limited data sizes
- It is possible to choose a trade-off between time

and accuracy

Block Decomposition

- Where (3-4-4-4) comes from
- ?0.261? 1
- ?0.262? 1
- ?0.263? 1
- ?0.264? 2
- ?0.265? 2
- ..
- ?0.2615? 4

Number of Patterns

- 82934

- Probability phit 829/1365 0.61
- Probability of finding pattern in all 20

sequences is (0.61)20 0

It works

- Use the algorithm as a sift, i.e. we run it to

find all patterns that occur in at least half of

the sequences. - Search the patterns reported by the algorithm and

use the suffix tree to check which ones actually

appear in all the sequences.

The next

- The probability that a pattern of length 15

occurs with up to four mutations of a give

position of a random sequence - The probability is seen by WEEDER is

- The probability that a pattern occurs in a

position in a form valid for WEEDER is

Finally

- The pattern occurs at least once in a sequence of

length n is - The probability that the pattern occurs at least

half of the sequence of set and is found by

WEEDER is - The expected number of patterns passed by WEEDER

to the second phase is

How good it works..

- A set of 20 sequences of length 600 raises less

than 10 patterns. - A set of 20 sequences of length 1000 raises to

about 300 patterns. - If we look for patterns that occur in at least 9

sequences - 50 for sequences of length 600
- 3500 for sequence of length 1000

Performance Evaluation (contd)

- We could partition the set of sequences in two

subsets of ten sequences each and the probability

that the pattern will pop up among the ones found

by the algorithm in either subset is - where pmiss 1 - phit

Performance Evaluation (contd)

- A pattern is expanded whenever at least one of

the two counters is greater than q. - q is a minimum number of sequences for each

subset

Performance Evaluation (contd)

- This approach can be pushed even further.
- Partition the set of sequences as long as the

parameters are such that only a few patterns

satisfy them. - And it works well on long signals, where the phit

value is lower and random patterns are unlikely

to be picked up by the sifting phase.

Thus

- The final exact search has to be performed on a

limited number of patterns. - When the signal length and the number of

mutations are known in advance, we can determine

the best parameter setting and search strategy

for WEEDER.

If the length of the pattern is not known in

advance

- Therere some additional problems to be faced
- If we choose
- The probability of hitting a (16, 4) pattern

is 0.54. - But the chances of finding a (15, 4) have

dropped to 0.45. (since its decomposed as

(4-4-4-3)) - gtThe probabilities of finding a pattern depends

on the block decomposition induced by .

Moreover

- Setting a lower threshold value q would increase

the number of candidates. - For example
- Run a algorithm with

on 20 sequences of length 400 hoping to find a

(16, 4) pattern. - gtThe constraints are satisfied by hundreds of

patterns of length 12.

What is the solution?

- One possible solution could be to investigate

only the longest patterns found. - gtBut a significant signal could be hidden also

among the shorter ones.

The solution we adopt

- Expand all patterns that appear in at least q

sequences, but report only those that occur in

q(m), which is set according to the pattern

length. - In the previous example
- A pattern of length 12 with 3 mutations can

be found with probability of about 90 by setting

q11. - gtThus, if a pattern of length 12 appears in at

least q(m)11 sequences, we can pass it to the

second phase. - Since q(m) can be set according to the number of

sequences given as input, and the parameters q

and .

If the nucleotide composition of the sequences is

not uniform

- For example, 1112

(T occurs twice as often as the others) - In order to avoid spending too much time in

the final phase, we can set the threshold q(m)

according to the pattern probability.

The points just discussed

- At the end, the algorithm might report more than

one pattern satisfying the constraints. - gtWe need to introduce significance
- measures to sort the output.

Sorting the Output

- The algorithm may output more than one pattern

under - The length of the signal to be found is unknown
- The sequences contain more than one signal

Example

- A successful (15,4) pattern can be expanded by

one symbol and becomes a (16,5) pattern. - All its occurrences are also valid occurrences of

the longer one.

A Method for Sorting Outputs

- A pattern P is given
- is the best instance of P in the sequence i
- is the total number of sequences P occurs in
- is 1 if P appears in sequence i,

otherwise 0 - means the distance between P and

Relative Entropy

- When the nucleotide composition of the sequences

is biased, we use the background probabilities to

define new match premiums and penalties.

Relative Entropy (contd)

- is the frequency with which residue r

occurs in position j in the occurrences of P - is the frequency of r in the sequences

Relative Entropy (contd)

- It is suitable for sorting patterns under

following situations - Patterns that appear the same number of times,

for example once in every sequence of the set. - Sequences containing multiple signals.

Another Measure of Significance

- We can define statistical measures of

significance, that compare the actual number of

occurrences of a pattern with the expected value.

Measure of Significance (contd)

- is the probability that P occurs in a

sequence of the set with at most e mutations. - N is the overall length of the sequences.
- This value can be computed explicitly (Tompa,1999)

Software Implementation

- Machine Pentium II class computer with 64MB RAM
- OS Linux
- Program Codes written in ANSI C, about 2500

lines long - Testing Data challenge problem as described

before, varying the length from 100 to 1000

nucleotides

Experiment Results -- Time

- Construction of the suffix tree always within

one second - To find the (15,4) signal with 89 success

probability, and run the algorithm with 0.26

and q 10. for sequence lengths up to 400

nucleotides. it took less than one minute

(including the final exact search). - Length 500 execution time grows significantly
- Length 500 and 600 average time 125 and 200

seconds - q 11 execution time drops to 100 and 120

Experiment Results Time (contd)

- Increasing the number of sequences did not

influence the execution time very much. - Example
- For every sequence length, the algorithm took

just a few more seconds when run over 30 or 40

sequences with q set to 16 and 21, respectively. - Sequence length set to 800, the program took on

the average 320 and 450 seconds to complete the

execution with q set to 11 and 10 respectively.

When length is 1000 long, it took about 15

minutes. - Thus the WEEDER algorithm is suited to work on a

large set of relatively short (up to 600

nucleotide) sequences than a small set of very

long sequences.