Finding%20Regulatory%20Signals%20in%20Genomic%20Sequences - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Finding%20Regulatory%20Signals%20in%20Genomic%20Sequences

Description:

... bit is set if the word spelled by the path ending at the node ... pattern occurring in the strings is spelled by a path starting from the root of the tree ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 108
Provided by: giuliopav
Learn more at: http://www.bio.disco.unimib.it
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Finding%20Regulatory%20Signals%20in%20Genomic%20Sequences


1
Finding Regulatory Signals inGenomic Sequences
Weeder ProFind
Giancarlo Mauri
Bioinformatics and Natural Computing
Group Dipartimento di Informatica, Sistemistica e
Comunicazione Università degli Studi di
Milano-Bicocca
2
Gene Expression Data
  • When and how much a gene is expressed under some
    given conditions (tissue, external stimuli,
    disease...)
  • We can group genes according to their expression
    profile
  • We can suspect a common cause for their
    expression

3
Transcription Factors
  • The expression of a gene starts with
    transcription from DNA to RNA
  • Transcription is modulated by dedicated proteins
    called transcription factors (TFs)
  • TFs bind to DNA in the regions surrounding the
    starting site of the gene (mostly upstream), and
    direct polymerases to the right spot to start
    transcription
  • Different effects may enhance or block
    transcription

4
Transcription Factors
5
TF
Transcription Starts
TFBS
6
AND
TF1
TF2
Transcription Starts
TFBS
7
NOT
TF1
TF2
Transcription Starts
TFBS
8
TFs Binding Sites (TFBSs)
Fundamental in regulatory analysis is the
identification of potential TFBSs
  • Bound by transcription factors
  • Short degenerate sequences, 5-16 nucleotides
    long, (gaps possible but rare)
  • Each TF does not recognize a single fragment but
    a set of them (similar to each other) called
    signal or motif
  • Can be illustrated by profiles and/or consensi
    (computational models)

9
TFs Binding Sites (TFBSs)
  • Fundamental in regulatory analysis is the
    identification of potential TFBSs
  • Bound by transcription factors
  • Short degenerate sequences, 5-16 nucleotides
    long, (gaps possible but rare)
  • Each TF does not recognize a single fragment but
    a set of them (similar to each other) called
    signal or motif
  • Can be illustrated by profiles and/or consensi
    (computational models)

10
Finding TFBSs
  • We have a set of related genes
  • similar expression profile
  • similar biological function
  • anything else..
  • We take their upstream regulatory regions
  • If they are regulated by the same TF(s), then we
    should find its (their) binding sites in the
    sequences
  • We should find short patterns conserved in the
    sequences
  • We could use the detected TFBS to predict the
    behavior of a gene

11
Finding Novel TFBSs
  • Over-representation
  • First, detect groups of similar oligos
  • Describe each group with a consensus or a profile
    (or in some other smart way)
  • Find the most over-represented groups

If sequences were built at random and/or we
picked sequences at random, the group should not
appear with the same size/conservation
12
Finding Novel TFBSs
  • Most of early research has focused on the first
    point how to detect the best groups
    (unfortunately, there are thousands of
    candidates) given simple score measures
  • Recent research has followed the second point
    which is the best measure to tell significant
    groups from random similarities?
  • Is it expected or not, to find a group that is
    conserved?
  • Can we take advantage from the wealth of sequence
    data available?

13
Weeder A tool for pattern discovery inGenomic
Sequences
Giancarlo Mauri Giulio Pavesi Graziano Pesole
Università degli Studi di Milano-Bicocca
Università degli Studi di Milano
14
References
  • G. Pavesi, G.Mauri, G.Pesole. An Algorithm for
    Finding Signals of Unknown Length in Unaligned
    DNA Sequences. Bioinformatics 17, S207-S214,
    2001
  • G.Pavesi, P.Mereghetti, G.Mauri, G.Pesole.
    WeederWeb Discovery of Transcription Factor
    Binding Sites in a Set of Sequences from
    Co-Regulated Genes. Nucleic Acids Research Web
    Server Issue 2004, 32 W199-W203
  • http//159.149.109.168080/weederWeb/

15
Weeder (2001)
  • Idea instead of reducing the set of candidate
    patterns, reduce the set of possible matches for
    each pattern, trying to save a significant
    number of valid occurrences
  • Instead of searching exhaustively for patterns
    that occur in every sequence, we
    short-sightedly look for patterns that occur in
    a subset of them
  • The algorithm needs as input only a given error
    ratio e

16
Suffix Trees
  • A suffix tree is a data structure that exposes
    the internal structure of a string in a very deep
    and meaningful way
  • Suffix tree T for S s1sn
  • rooted directed tree
  • exactly n leaves numbered 1 to n
  • internal nodes with at least two sons
  • edges labeled by non empty substrings of S
  • labels out of the same node begin with different
    symbols
  • the concatenation of the edge labels on the path
    from the root to any leaf i exactly spells the
    suffix of S starting at position i, i.e., s1sn

17
Suffix Trees
  • The same structure can be built also for a set of
    k sequences
  • To distinguish which sequence a suffix belongs
    to, it appends a different marker symbol, not
    occurring elsewhere, to each sequence in the set.
  • It is also possible to annotate each node of the
    tree with a k-bit string, where the i-th bit is
    set if the word spelled by the path ending at the
    node occurs in the i-th sequence.

18
Suffix Trees
Suffix tree for ACCA (end with ) and CCAAG (end
with )
19
Suffix Trees
  • A generalized suffix tree can be built in O(N)
    time and takes O(N) space, where N is the overall
    length of the sequences
  • Annotating it with the bit strings takes
    additional O(kN) time
  • Each pattern occurring in the strings is spelled
    by a path starting from the root of the tree
  • The time needed to search for a pattern depends
    only on the length of the pattern
  • The structure allows to implement recursively the
    exhaustive enumeration of all the candidate
    patterns of a given length
  • The time complexity is thus reduced to be
    exponential in the maximum number of mutations
    allowed (Sagot, 1998)

20
Searching for an Exact Pattern
  • Given a set of sequences and the annotated suffix
    tree, every pattern appearing in at least one
    sequence of the set is spelled by a unique path
    starting from the root
  • We match the symbols of pattern p along the
    unique path in the tree until
  • p is exhausted
  • In this case, the bit string on the next node on
    the path specifies which sequences p appears in
  • no more matches are possible

21
Searching for a Pattern with Mismatches
  • We can also search for a pattern p with at most e
    mismatches in a similar way.
  • We match p along different paths on the tree at
    the same time, keeping track of the number of
    mismatches encountered on each path.
  • Whenever the number of errors on a path is
    greater than e, we discard that path.
  • The sequences p appears in are given by the
    logical-OR of the bit strings corresponding to
    the different paths.

22
Searching for (M, e) Patterns
  • The algorithm starts with the empty pattern from
    the root of the tree, and recursively expands it
  • Let us suppose we have found on the tree the
    endpoints of paths corresponding to the
    occurrences of a pattern pp1pm in the
    sequences, where all the paths spell words within
    distance e from p, with mltM
  • If p occurs in at least q sequences, we try and
    expand it by one symbol

23
Searching for (M, e) Patterns
  • Expanding a pattern by one symbol
  • For each character b ? A, C, G, T, we match b
    against the next symbol on each path
  • If a path ends just before a node V of the tree,
    we match b against the first symbol on each edge
    leaving V
  • When we encounter a mismatch, we increase the
    previous error along the path by one
  • If the new error is greater than e, we discard
    the path

24
Searching for (M, e) Patterns
  • Once all paths have been checked, the surviving
    ones represent the approximate occurrences of
    pp1pmb
  • If p occurs in at least q sequences, and is
    shorter than M, we expand p as well. Otherwise,
    we continue with p and the next character in ?

25
Searching for (M, e) Patterns
  • For example
  • It matches the first symbol on each edge leaving
    the root against A.
  • If A is valid, i.e., A occurs in at least q
    sequences, it is expanded to AA.
  • If also AA is valid, we move to AAA, and so on.
  • If it is not valid, we proceed to look for
    occurrences of AC.
  • In this method, patterns dont have to occur
    exactly in the sequences.

26
Searching for (M, e) Patterns
  • The main drawback is that every pattern of length
    e satisfies the input constraints, since every
    other pattern of length e found in the tree is a
    valid occurrence for it
  • Thus, the method works well only for small values
    of e

27
Searching for (M, e) Patterns
At the beginning of the search, all paths of
length e are valid
28
Searching for (M, e) Patterns
  • To apply the algorithm also to longer patterns
    with higher values of e, instead of reducing the
    set of patterns that have to be searched, we
    restrict the number of paths that have to be
    followed for each pattern.
  • That is, we narrow down the set of valid
    occurrences-the WEEDER algorithm.

29
Searching for Approximate Occurrences of Patterns
  • Problem Definition
  • Given a set of k sequences on the alphabet ?
    A, C, G, T , we want to find all (M, e)
    patterns
  • (M, e) patterns patterns of length M that occur
    with at most e mismatches in at least q sequences
    of the set

30
The Outline of WEEDER
  • WEEDER fixes an initial error ratio ?
  • Given a pattern p, a path is valid if the
    distance from p to the path is not greater than
    ?? p?
  • p is the length of the pattern
  • When we expand p by one symbol, the error
    threshold is set to ?? (p1)?

31
Block Decomposition of a Pattern
  • Each block size is ?1/??
  • Let p p1pm. We can see p as composed of
    ??m? blocks

32
Valid Occurrences
  • For every pattern p p1pm, valid occurrences
    are words si1sim occurring in the sequences
    for which ? j ?1,, m d(p1pj, si1sij) ?
    ??j?
  • d(p1pj, si1sij) is the number of mismatches
    between p1pj and si1sij
  • si1sim is a valid occurrence for p if it is a
    valid occurrence for all its prefixes
  • p1, p1p2, , p1p2pm-1

33
An Example for WEEDER
G

C
A
GCTCA
C
T
T


ATCACGC
C

GC
GC

T
T
C
CA
TC

GCT
A

C

CACGC
A
A
GCT
CGC


CGC

34
ACTC? ACTCA
G

C
A
GCTCA
T
C
T


ATCACGC
C
GC

GC

T
CA
T
C
TC

GCT
A

C

CACGC
A
A
GCT
CGC


CGC

ACTC error max 1. S1, S2, S4 contain ACTC.
S2 AGCTCA
S1 AATCACGC
S3 ATGCT
S4 ACTC
35
ACTCA ? ACTCAA
G

C
A
GCTCA
T
C
T


ATCACGC
C
GC

GC

T
T
C
CA
TC

A
GCT


C
CACGC
A
A
GCT
CGC


CGC

ACTCAA error max 2. S1, S2 contain
ACTCAA. ACTCAC, ACTCAG, ACTCAT are also patterns.
S1 AATCACGC
S2 AGCTCA
S4 ACTC
S3 ATGCT
36
Weeder (2001)
  • Given a pattern P p1p2....pm, the algorithm
    can find all the valid occurrences of P (with at
    most ?P mutations), such that at most ?i
    mutations occur in the first i letters of the
    pattern
  • But some occurrences of a pattern can be missed
    altogether
  • Are DNA signals always so polite to show up in
    blocks-decomposed form?
  • The answer is no, but we can use Weeder with a
    grain of salt

37
Using Weeder
  • Example (15,4) pattern occurring in 20 sequences
  • Valid (block decomposed) possible occurrences
    829
  • Total possible occurrences1365
  • Probability if hitting a possible occurrence in
    a sequence phit.61
  • Probability of finding the pattern in every
    sequence like trying to win the national lottery
  • If we search for patterns occurring in at least
    10 sequences, the probability of seeing at
    least 10 times the pattern is
  • Phit(20,10) .89

38
Using Weeder
  • Thus, we can use Weeder as a sieve, to filter the
    set of candidate patterns
  • All patterns that are found to occur in at least
    q of the sequences by Weeder can be searched
    again in the sequences, but this time with no
    restriction on the position of mismatches
  • We expect the number of patterns (random patterns
    other than the real signal) passed to the second
    phase to be much smaller than the original number
    (and no longer exponential)

39
Using Weeder
  • The probability of finding a pattern in a
    sequence depends on its length and the error
    ratio
  • The probability of finding a pattern in a set of
    sequences (and thus the choice of the quorum q
    for the first phase) depends on the number of
    sequences
  • The same approach can be applied also when the
    signal does not show up in each sequence

40
Using Weeder
  • When the signal to be found is expected to be
    short, the algorithm can be used in exact mode
  • For longer signal, the lower is the quorum q, the
    higher is the probability of finding the signal
  • But also the number of patterns satisfying the
    input constraints is higher, and the program is
    slower
  • Users can choose a suitable trade-off between
    time and accuracy

41
Theoretical Time Complexity
  • Naïve approach O(4men)
  • Suffix tree approach (Sagot, 1998) O(4emekn)
  • where n is the input size, m is the pattern
    length, and e is the number of mutations allowed
  • Weeder
  • O((1/?)e4ekn)
  • where e is the number of mutations occurring in
    the longest pattern found

42
Weeder Web
  • Weeder Web is a web interface to the Weeder
    algorithm, where all the parameters concerning
    the motifs are automatically set for the
    discovery of transcription factor binding sites
  • Although there is no pre-set limit on the length
    of the input sequences, feasible results can be
    obtained by submitting sequences of "typical"
    length for regulatory/promoter regions (i.e. from
    500 to 5000 bps)
  • A priori, there's no limit on the number of
    sequences you can input. Also, for the moment we
    do not consider correlations among different
    motifs (i.e. cis-regulatory modules)

43
Weeder Web
  • All the statistical measures (background oligo
    frequencies, expected occurrences and so on) used
    to score/rank motifs and to post-process the
    output have been derived from the analysis of
    promoter/enhancer and 5'UTR regions only (taken
    from different organisms)
  • If you submit something else (i.e. 3'UTRs, coding
    regions, noncoding RNAs, and so on) the
    statistical evaluation probably will not be
    consistent with your data, and thus produce
    unreliable results
  • http//www.pesolelab.it/Tool/ind.php

44
Post-Processing
  • Real motifs have different degrees of variation
    in different positions
  • Some admit any nucleotide
  • Some are (almost) perfectly conserved
  • We should find redundant motifs among the
    highest-scoring ones
  • Pieces of a long motif should appear also in
    shorter results

45
Post-Processing
  • Look for redundant (either in length or in
    conservation) motifs in the reports of each run
  • Collect the instances of each one and build a
    frequency matrix
  • Scan the sequences looking for matches
  • Report the best matches (with no constraint on
    the substitutions allowed)

46
Assessment results
Tompa et.al., Nat Biotechnol. 2005
Jan23(1)137-44.
47
Assessment results
48
ProFind A GA Approach to the Definitionof
Regulatory Signals inGenomic Sequences
  • Characterization of CAP and TATA-box through
    probability matrices with a genetic algorithm

Giancarlo Mauri, Roberto Mosca and Giulio Pavesi
Bioinformatics and Natural Computing
Group Dipartimento di Informatica, Sistemistica e
Comunicazione Università degli studi di
Milano-Bicocca
49
TATA Box and CAP Binding Sites
  • A large number of genes present two
    characteristic signals
  • TATA-box
  • 25-35 bp upstream of the TSS
  • When discovered it was given a TATA consensus
  • Bound by the TATA Binding Protein (TBP) part of a
    large complex of some 50 different proteins
    including TFIID and TFIIB
  • CAP (also called Initiator or Inr)
  • Straddles the TSS
  • Experimental evidence that it is bound by TFIID
    too
  • Previous characterization by Bucher 1990 with a
    CAPy consensus
  • Very strong positional preference for the two
    signals with respect to the TSS

50
Describing Binding Sites
  • Frequency Matrix nij

.........0........... ...CGTGCCATTTGTTGT... ...TCC
TACAGTGCAGCA... ...TCACATATTATTGTC... ...GAAAGCAAC
AACTAA... ...TAAATCGTCAGTGTA... ...CCGACCAGAGTGAAA
... ...GGGTTTGGTTTGATA... ...GCGTGCAGTTGTGAA... ..
.GTCGCCATATACACA... ...GTGGCCGTATGCGCT...
.........0........... ...CGTGCCATTTGTTGT... ...TCC
TACAGTGCAGCA... ...TCACATATTATTGTC... ...GAAAGCAAC
AACTAA... ...TAAATCGTCAGTGTA... ...CCGACCAGAGTGAAA
... ...GGGTTTGGTTTGATA... ...GCGTGCAGTTGTGAA... ..
.GTCGCCATATACACA... ...GTGGCCGTATGCGCT...
-2 -1 0 1 2
A 2 0 7 1 3
C 4 8 0 0 2
G 2 0 3 4 0
T 2 2 0 5 5
-2 -1 0 1 2
A 0.2 0 0.7 0.1 0.3
C 0.4 0.8 0 0 0.2
G 0.2 0 0.3 0.4 0
T 0.2 0.2 0 0.5 0.5
51
Frequency Matrix
  • Given a signal of length m, a matrix P of
    dimension 4?m is built up, taking
    Pi,jPnucleotide i at position j
  • We suppose the existence of two different models
  • The signal
  • The background

Signal occurence
Weights stored in the matrix
52
The Problem
  • Dataset

... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...
TCCAGCCCGAACAAAATCGATCAAAAT ... ...
ATCCCTCTGGCCATTGGCAATCGATCC ... ...
AGAAACAAAACGGCTTGTAAAACAAAC ... ...
GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...
GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...
AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...
AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...
AGTTTTCATATTGTTACCGTTTGAGTT ... ...
TTCACCAGCCACTTTCAGTCGGTTTAT ... ...
GCATAACGAATCACTCTGATCGCTGTC ... ...
GGTCCAGCGACCACTCGCAGTTCTACA ... ...
GATCGGCGTGCCATTTGTTGTTGAATC ... ...
CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...
TCCAAGTCACCGATTATTGTCTCAGTG ... ...
GAACTGGAAACCAACAACTAACGGAGC ... ...
TCAGTCTAAATTTACCCTGTAAAATTC ... ...
GTAGTTCCGACCAGAGTGAAACTGAAC ... ...
CTTTATGGGTTTAGTTTGATAGGAGTC ... ...
TCACTGGCGTTGTTAGAGTTGTGAATG ...
... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...
TCCAGCCCGAACAAAATCGATCAAAAT ... ...
ATCCCTCTGGCCATTGGCAATCGATCC ... ...
AGAAACAAAACGGCTTGTAAAACAAAC ... ...
GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...
GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...
AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...
AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...
AGTTTTCATATTGTTACCGTTTGAGTT ... ...
TTCACCAGCCACTTTCAGTCGGTTTAT ... ...
GCATAACGAATCACTCTGATCGCTGTC ... ...
GGTCCAGCGACCACTCGCAGTTCTACA ... ...
GATCGGCGTGCCATTTGTTGTTGAATC ... ...
CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...
TCCAAGTCACCGATTATTGTCTCAGTG ... ...
GAACTGGAAACCAACAACTAACGGAGC ... ...
TCAGTCTAAATTTACCCTGTAAAATTC ... ...
GTAGTTCCGACCAGAGTGAAACTGAAC ... ...
CTTTATGGGTTTAGTTTGATAGGAGTC ... ...
TCACTGGCGTTGTTAGAGTTGTGAATG ...
-2 -1 0 1 2
A 0.2 0 0.7 0.1 0.2
C 0.3 0.7 0 0 0.2
G 0.2 0 0.2 0.4 0
T 0.2 0.2 0 0.4 0.5
53
The Score Function
  • The score function is made up of two terms
  • A positive term
  • A negative term

Total score positive score negative score
54
The Score Function
55
The Genetic Algorithm
... TTTTGTTTTTTTATTTCCTGTATTTTT ... ...
TCCAGCCCGAACAAAATCGATCAAAAT ... ...
ATCCCTCTGGCCATTGGCAATCGATCC ... ...
AGAAACAAAACGGCTTGTAAAACAAAC ... ...
GTGCAGTGAGTCAGTGTGTTGTGTGCC ... ...
GAGCGTAAGCAAGAGAGAGAGGTGAAG ... ...
AGGTGAAGCCAGGGGCGGAGGCGCAAG ... ...
AGAAAAGAGAGAGTGAAAGCATAGTCC ... ...
AGTTTTCATATTGTTACCGTTTGAGTT ... ...
TTCACCAGCCACTTTCAGTCGGTTTAT ... ...
GCATAACGAATCACTCTGATCGCTGTC ... ...
GGTCCAGCGACCACTCGCAGTTCTACA ... ...
GATCGGCGTGCCATTTGTTGTTGAATC ... ...
CCGCTCTCCTCCAGTGCAGCAGCAGCA ... ...
TCCAAGTCACCGATTATTGTCTCAGTG ... ...
GAACTGGAAACCAACAACTAACGGAGC ... ...
TCAGTCTAAATTTACCCTGTAAAATTC ... ...
GTAGTTCCGACCAGAGTGAAACTGAAC ... ...
CTTTATGGGTTTAGTTTGATAGGAGTC ... ...
TCACTGGCGTTGTTAGAGTTGTGAATG ...
1
1
1
0
1
0
0
0
0
0
1
1
1
1
0
1
0
1
1
0
-2 -1 0 1 2
A 0.2 0 0.7 0.1 0.2
C 0.3 0.7 0 0 0.2
G 0.2 0 0.2 0.4 0
T 0.2 0.2 0 0.4 0.5
56
The Genetic Algorithm
  • One-point crossover
  • Mutation operator flips one bit with a
    probability pm
  • Given a genome the frequency matrix is calculated
    from the sequences selected by the genome itself
  • The matrix is scored according to the previously
    defined score function
  • A local optimization procedure is applied to the
    resulting best individual

57
Results
  • Three different datasets
  • Eukaryotic Promoter Database(EPD,
    www.epd.isb-sib.ch, rel. 74)
  • Vertebrates 2199 seqs, from -50 to 50
  • Homo Sapiens 1796 seqs (included in the previous
    one)
  • Drosophila Core Promoters Database(DCPD,
    www-biology.ucsd.edui/labs/kadonaga/DCPD.htm)
  • 205 seqs, from -47 to 44

58
Results - CAP
  • Start position -2
  • Length 6
  • Population 500
  • Generations 20000
  • Crossover Probability 0.9
  • Mutation Probability 0.04

59
Results - CAP
  • CA, CG, TA and TG dinucleotides starting at
    position -1 in for the EPD dataset
  • CA, TA dinucleotides starting at position -1 for
    the DCPD dataset (consistently with
    Kadonaga,2000

Kutach, A., Kadonaga, J. The downstream promoter
element DPE appears to be as widely used as the
TATA box in Drosophila core promoters. Mol. Cell
Biol. 20 (2000) 475464
60
Results - CAP
61
Results TATA Box
  • Start position -30
  • Length 8
  • Population 500
  • Generations 20000
  • Crossover Probability 0.9
  • Mutation Probability 0.04

Consistent with the finding that the TBP
recognizes the minor grove of DNA, where
protein-DNA interactions are typically influenced
by A/T-content, but not by the specific
nucleotide sequence.
Kim, J.L., Nikolov, D.B., Burley, S.I.
Co-crystal structure of TBP Recognizing the minor
groove of a TATA element. Nature 365 (1993)
520527 Lo, K., Smale, S.T. Generality of a
functional initiator consensus sequence. Gene 182
(1996) 1322
62
  • The signals found have proved to be consistent
    with those described experimentally, as in the
    case of fruit fly signals
  • Differently from previous approaches to the same
    problem
  • it does not make any prior assumption about the
    signals structure
  • it takes advantage of the specific localization
    of the signals considered
  • We do not need to know in advance which sequences
    contain the signal, this is taken care of by the
    algorithm
  • Further improvements include
  • Development of models taking into account
    correlation between adjacent positions
  • Application of the method to new datasets with a
    better characterization of the TSS to investigate
    positional correlation between the TATA-box and
    the TSS and the presence of alternative TSSs

63
SeQuAl
  • Large Scale Multiple Alignment of Genomic
    Sequences

64
Summary
  • Multiple Sequence Alignment
  • The Methods So Far
  • SeQuAl project
  • Producing a Threaded Blockset Alignment
  • Results possible improvements

65
Multiple Sequence Alignment (1)
  • The multiple sequence alignment problem is the
    process of taking three or more input sequences
    and forcing them to have the same length by
    inserting a universal gap symbol, in order to
    maximize their similarity as measured by a score
    function.

66
Multiple Sequence Alignment (2)
67
Dynamic Programming Vs Heuristics
  • Dynamic Programming Techniques
  • O(NxM) for two sequences of length N and M
  • Exponential for multiple sequences
  • O(nxM2) for n sequences of length approximately M
    using some heuristics (ClustalW)

Prohibitively time-consuming for sequences of
length exceeding 10 kb
Need for more heuristics to speed up the alignment
68
The Methods So Far
  • Pairwise sequence alignment tools
  • MUMmer, GLASS, WABA, BLASTZ
  • Multiple sequence alignment tools
  • MAVID, MLAGAN, MGA, Mauve
  • Anchoring techniques
  • Seed ungapped alignments and extensions
  • Filling gaps using dynamic programming

69
SeQuAl project
  • Goals
  • Ability to find conservation even in the presence
    of mutations on amino acids
  • Ability to find conservation in subsets of the
    original set
  • Anchoring technique based on text indexing
    structures
  • Hashing functions on nucleotides and amino acids

70
SeQuAl project
  • Four steps alignment
  • Anchors search
  • Anchors chaining
  • Anchors refinement
  • Gaps filling

New!
New!
71
Anchors Discovery (1)
  • Discovery of MUMs and MEMs using Suffix Trees
  • Longer than a minimum threshold
  • Shorter than a maximum length for speeding up the
    time complexity
  • Statistically significant
  • Searched on
  • the nucleotide sequence
  • the amino acid sequence
  • on the sequence of classes of aminoacids
  • Fragments generation

New!
New!
72
Anchors Discovery (2)
S ACGTCA R TGTCTA
73
Fragments Chaining
  • Ordering relation between fragments

lt
lt
lt
lt
  • Graph induced by the ordering relation

Shortest Path on a DAG
74
Partial Anchors (1)
  • Ordering relation?

lt
lt
  • Order the partial fragments based on their
    barycenter
  • Use a shifting barrier to impose a total ordering
    between partial fragments

75
Partial Anchors (2)
76
Producing a Threaded Blockset Alignment
  • Block is a rectangular array of symbols such that
    removing dashes from any row produces a run of
    one or more consecutive positions in one of the
    original sequences
  • Blockset a set of such blocks
  • A ref-blockset" consists of a blockset in which
    every block has a designated row, all of which
    come from the same original sequence, called the
    reference for that ref-blockset.
  • If a blockset is threaded by each of the original
    sequences, we call it a threaded blockset.

Local Alignment with ClustalW
77
Results (1)
78
Results (2)
79
  • Approximate anchors in non-coding areas of the
    genome
  • Developing a hashing function for regulatory
    regions
  • Important evolutionary events
  • Modeling rearrangements and inversions
  • Memory-saving implementation of suffix trees
  • Custom visualization tool and classification of
    aligned regions based on the nature of anchors

80
Conclusions
  • Exhaustive methods
  • Suitable for short patterns and a once-for-all
    analysis of data (e.g. whole genomes)
  • Sequence driven methods
  • Give faster but less accurate answers
  • Limited data sizes
  • It is possible to choose a trade-off between time
    and accuracy

81
Block Decomposition
  • Where (3-4-4-4) comes from
  • ?0.261? 1
  • ?0.262? 1
  • ?0.263? 1
  • ?0.264? 2
  • ?0.265? 2
  • ..
  • ?0.2615? 4

82
Number of Patterns
  • 82934
  • Probability phit 829/1365 0.61
  • Probability of finding pattern in all 20
    sequences is (0.61)20 0

83
It works
  • Use the algorithm as a sift, i.e. we run it to
    find all patterns that occur in at least half of
    the sequences.
  • Search the patterns reported by the algorithm and
    use the suffix tree to check which ones actually
    appear in all the sequences.

84
The next
  • The probability that a pattern of length 15
    occurs with up to four mutations of a give
    position of a random sequence
  • The probability is seen by WEEDER is
  • The probability that a pattern occurs in a
    position in a form valid for WEEDER is

85
Finally
  • The pattern occurs at least once in a sequence of
    length n is
  • The probability that the pattern occurs at least
    half of the sequence of set and is found by
    WEEDER is
  • The expected number of patterns passed by WEEDER
    to the second phase is

86
How good it works..
  • A set of 20 sequences of length 600 raises less
    than 10 patterns.
  • A set of 20 sequences of length 1000 raises to
    about 300 patterns.
  • If we look for patterns that occur in at least 9
    sequences
  • 50 for sequences of length 600
  • 3500 for sequence of length 1000

87
Performance Evaluation (contd)
  • We could partition the set of sequences in two
    subsets of ten sequences each and the probability
    that the pattern will pop up among the ones found
    by the algorithm in either subset is
  • where pmiss 1 - phit

88
Performance Evaluation (contd)
  • A pattern is expanded whenever at least one of
    the two counters is greater than q.
  • q is a minimum number of sequences for each
    subset

89
Performance Evaluation (contd)
  • This approach can be pushed even further.
  • Partition the set of sequences as long as the
    parameters are such that only a few patterns
    satisfy them.
  • And it works well on long signals, where the phit
    value is lower and random patterns are unlikely
    to be picked up by the sifting phase.

90
Thus
  • The final exact search has to be performed on a
    limited number of patterns.
  • When the signal length and the number of
    mutations are known in advance, we can determine
    the best parameter setting and search strategy
    for WEEDER.

91
If the length of the pattern is not known in
advance
  • Therere some additional problems to be faced
  • If we choose
  • The probability of hitting a (16, 4) pattern
    is 0.54.
  • But the chances of finding a (15, 4) have
    dropped to 0.45. (since its decomposed as
    (4-4-4-3))
  • gtThe probabilities of finding a pattern depends
    on the block decomposition induced by .

92
Moreover
  • Setting a lower threshold value q would increase
    the number of candidates.
  • For example
  • Run a algorithm with
    on 20 sequences of length 400 hoping to find a
    (16, 4) pattern.
  • gtThe constraints are satisfied by hundreds of
    patterns of length 12.

93
What is the solution?
  • One possible solution could be to investigate
    only the longest patterns found.
  • gtBut a significant signal could be hidden also
    among the shorter ones.

94
The solution we adopt
  • Expand all patterns that appear in at least q
    sequences, but report only those that occur in
    q(m), which is set according to the pattern
    length.
  • In the previous example
  • A pattern of length 12 with 3 mutations can
    be found with probability of about 90 by setting
    q11.
  • gtThus, if a pattern of length 12 appears in at
    least q(m)11 sequences, we can pass it to the
    second phase.
  • Since q(m) can be set according to the number of
    sequences given as input, and the parameters q
    and .

95
If the nucleotide composition of the sequences is
not uniform
  • For example, 1112
    (T occurs twice as often as the others)
  • In order to avoid spending too much time in
    the final phase, we can set the threshold q(m)
    according to the pattern probability.

96
The points just discussed
  • At the end, the algorithm might report more than
    one pattern satisfying the constraints.
  • gtWe need to introduce significance
  • measures to sort the output.

97
Sorting the Output
  • The algorithm may output more than one pattern
    under
  • The length of the signal to be found is unknown
  • The sequences contain more than one signal

98
Example
  • A successful (15,4) pattern can be expanded by
    one symbol and becomes a (16,5) pattern.
  • All its occurrences are also valid occurrences of
    the longer one.

99
A Method for Sorting Outputs
  • A pattern P is given
  • is the best instance of P in the sequence i
  • is the total number of sequences P occurs in
  • is 1 if P appears in sequence i,
    otherwise 0
  • means the distance between P and

100
Relative Entropy
  • When the nucleotide composition of the sequences
    is biased, we use the background probabilities to
    define new match premiums and penalties.

101
Relative Entropy (contd)
  • is the frequency with which residue r
    occurs in position j in the occurrences of P
  • is the frequency of r in the sequences

102
Relative Entropy (contd)
  • It is suitable for sorting patterns under
    following situations
  • Patterns that appear the same number of times,
    for example once in every sequence of the set.
  • Sequences containing multiple signals.

103
Another Measure of Significance
  • We can define statistical measures of
    significance, that compare the actual number of
    occurrences of a pattern with the expected value.

104
Measure of Significance (contd)
  • is the probability that P occurs in a
    sequence of the set with at most e mutations.
  • N is the overall length of the sequences.
  • This value can be computed explicitly (Tompa,1999)

105
Software Implementation
  • Machine Pentium II class computer with 64MB RAM
  • OS Linux
  • Program Codes written in ANSI C, about 2500
    lines long
  • Testing Data challenge problem as described
    before, varying the length from 100 to 1000
    nucleotides

106
Experiment Results -- Time
  • Construction of the suffix tree always within
    one second
  • To find the (15,4) signal with 89 success
    probability, and run the algorithm with 0.26
    and q 10. for sequence lengths up to 400
    nucleotides. it took less than one minute
    (including the final exact search).
  • Length 500 execution time grows significantly
  • Length 500 and 600 average time 125 and 200
    seconds
  • q 11 execution time drops to 100 and 120

107
Experiment Results Time (contd)
  • Increasing the number of sequences did not
    influence the execution time very much.
  • Example
  • For every sequence length, the algorithm took
    just a few more seconds when run over 30 or 40
    sequences with q set to 16 and 21, respectively.
  • Sequence length set to 800, the program took on
    the average 320 and 450 seconds to complete the
    execution with q set to 11 and 10 respectively.
    When length is 1000 long, it took about 15
    minutes.
  • Thus the WEEDER algorithm is suited to work on a
    large set of relatively short (up to 600
    nucleotide) sequences than a small set of very
    long sequences.
About PowerShow.com