Motif Discovery in DNA Sequences with EC - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Motif Discovery in DNA Sequences with EC

Description:

Experimental methods (in vivo) most accurate and reliable ... St-GA Stine et al CEC'03. Consensus-led ... St-GA Stine et al CEC'03. Use BLAST for alignment ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: cse12

Category:

more less

Transcript and Presenter's Notes

Title: Motif Discovery in DNA Sequences with EC

1
Motif Discovery in DNA Sequences with EC

Cyrus Chan

2
Biological Background
Transcription Factor
Expressed
Coexpressed Genes
TFBS motifs
3
Problem Description
Upstream Sequences
4
TFBS Identification

Experimental methods (in vivo)
most accurate and reliable
time-consuming and expensive.
Computational methods
based on the upstream sequences
fast and inexpensive
de novo TFBS identification

5
Deterministic Methods

A quick look at deterministic methods
Enumeration of all possible motifs
Approximate matching
(l, d)-motif discovery problem
Generalized suffix trees O(Nk2ld?d)
Disadvantages
Over predict a large amount of output
TFBS motifs are weakly conserved

d?
6
Conventional Methods

Multiple Sequence Alignment methods
Do not lose any information
Do not generalize the sequence data
Limited help for biological understanding
Large
Machine learning methods
EM, Gibbs sampling, HMM, Neural networks
Do not produce biologically meaningful results
Prior knowledge weight matrix
Local search local optima

7
Evolutionary Computation

Why use EC for motif discovery?
Global search though also not guarantee optimal
solutions
Good scaling
Flexibility of scoring
Flexibility of representation

Review Lones et al GECCO05
8
EC methods for Motif Discovery

FMGA Finding Motifs by Genetic Algorithm
MDGA Motif Discovery Using A Genetic Algorithm
GACluster Identification of Weak Motifs in
Multiple Biological Sequences using Genetic
Algorithm
St-GA Motif Discovery in Upstream Sequences of
Coordinately Expressed Genes
Discovery, validation, and genetic dissection of
transcription factor binding sites by comparative
and functional genomics

9
FMGA Liu et al BIBE04

Framework

IUPAC ambiguity codes
10
FMGA Liu et al BIBE04

Consensus led consensus generated randomly
Fitness function

m is the index of sequences, i is the position
within the motif , n is the index of motif
patterns, k is the length of motif pattern, j is
number of matched regions in the sequence
11
FMGA Liu et al BIBE04

IUPAC ambiguity codes
Total fitness score function
where L is the total number of sequences

12
FMGA Liu et al BIBE04

Operators
Mutation create a weight matrix from the matched
motif patterns
Mutate those not completely conserved randomly

13
FMGA Liu et al BIBE04

Operators
Crossover one-point crossover
Ambiguity codes penalty

14
FMGA Liu et al BIBE04

Rearrangement
If the predicted motif pattern is unchanged for
more than K generations (e.g., K 10)
For diversity

15
FMGA Liu et al BIBE04

Experiments
Compared with MEME and Gibbs sampler. FMGA have
better prediction results than the others.
For the computation time, FMGA is faster than
MEME and slower than Gibbs sampler.
Comments
Early method
Biological meaning AAA., TTT

16
MDGA Che et al GECCO05

Positions-led like Gary B Fogels method
Representation concatenated binary encoded
string
Fitness information content
Pseudo count (db) is used

Where fb is the observed frequency of nucleotide
b on the column and pb is the background
frequency of the same nucleotide. The summation
is taken over the four possible types of
nucleotides (b). W is the motif width.
17
MDGA Che et al GECCO05

Selection
Roulette wheel mechanism
The phase problem
shifting all starting positions to the left or
right by a small number
Crossover operators
single-point and double-point
Mutation
bitwise mutation operator

18
MDGA Che et al GECCO05

Experiment Results
Crossover operators
Mutation Rate 0.01
Compared with the Gibbs Sampler and the
BioProspector
on the set of 18 sequences
Computational time is better than AlignACE

19
MDGA Che et al GECCO05
Measured by deviation from the true starting
positions ER

Experiment Results

Gibbs Sampler
Bio-Prospector
MDGA
20
MDGA Che et al GECCO05

Conclusion
better prediction accuracy
search spaces with a better strategy
shorter running time for long sequences
The assumption
each sequence contains a motif
zero to more in real cases

21
GACluster Paul et al GECCO06

Consensus-led with GA Alignment technique
Tackles with (l, d) motif discovery problem as
well as weakly conserved motif discovery
The framework

Fitness Evaluation Cluster Alignment score of
subsequences
22
GACluster Paul et al GECCO06

Consensus from subsequences
Focus on weakly conserved motifs
Fitness Evaluation
the alignment score (Information Content)
Non-linear combination

23
GACluster Paul et al GECCO06

Fitness Evaluation (cont.)
Example
AT, AC, AG, AA, AC, TC, AG, TG
Clustering and Scoring

24
GACluster Paul et al GECCO06
Min d 1
Min d 1
Min d 1
25
GACluster Paul et al GECCO06

Fitness
Offspring generation
One point crossover
Single mutation
Dealing with Poly-A and TATA box
Reduce the fitness

26
GACluster Paul et al GECCO06

Experiments
CRP motifs
The method with 3 motifs found outperforms the
binary GA, which finds no real motifs
MCB
True motifs ACGCGT, ACGCGA, CCGCGT, TCGCGA,
ACGCGT, ACGCGT Consensus WCGCGW
GACluster ACGCGT, ACGCGT, ACGCGT, ACGCGT,
ACGCGT, ACGCGT Consensus ACGCGT
Binary GA TTTCGA, TCACCA, TCACGT, TGACGA,
TCACGA, TAACGG None are true motifs

True Motifs
27
GACluster Paul et al GECCO06

Discussion
starting population is very important
prevent loss of some initial motifs and to keep
diversity
Conclusions
Drawback Like other computational methods of
motif discovery, our method looks for similar
subsequences in multiple biological sequences
many of these similar subsequences have no
biological significance.

28
St-GA Stine et al CEC03

Consensus-led
Representation Structured GA with binary
encoding( two bits for each base)
A lt S1, S2 gt S1activation level S2 expression
level
A (ai, aij), (ai 0,1 , i 0.. .14)
(aij 0,1, i 0.. .14 j 0 . . . l ) .
The interpreted string is constructed from A by
concatenating each two bit block of S2 ai0,
ai1 for which ai S1 1

Various Lengths of motifs
29
St-GA Stine et al CEC03