Motif Discovery in DNA Sequences with EC - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Motif Discovery in DNA Sequences with EC

Description:

Experimental methods (in vivo) most accurate and reliable ... St-GA Stine et al CEC'03. Consensus-led ... St-GA Stine et al CEC'03. Use BLAST for alignment ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: cse12
Category:

less

Transcript and Presenter's Notes

Title: Motif Discovery in DNA Sequences with EC


1
Motif Discovery in DNA Sequences with EC
  • Cyrus Chan

2
Biological Background
Transcription Factor
Expressed
Coexpressed Genes
TFBS motifs
3
Problem Description
Upstream Sequences
4
TFBS Identification
  • Experimental methods (in vivo)
  • most accurate and reliable
  • time-consuming and expensive.
  • Computational methods
  • based on the upstream sequences
  • fast and inexpensive
  • de novo TFBS identification

5
Deterministic Methods
  • A quick look at deterministic methods
  • Enumeration of all possible motifs
  • Approximate matching
  • (l, d)-motif discovery problem
  • Generalized suffix trees O(Nk2ld?d)
  • Disadvantages
  • Over predict a large amount of output
  • TFBS motifs are weakly conserved

d?
6
Conventional Methods
  • Multiple Sequence Alignment methods
  • Do not lose any information
  • Do not generalize the sequence data
  • Limited help for biological understanding
  • Large
  • Machine learning methods
  • EM, Gibbs sampling, HMM, Neural networks
  • Do not produce biologically meaningful results
  • Prior knowledge weight matrix
  • Local search local optima

7
Evolutionary Computation
  • Why use EC for motif discovery?
  • Global search though also not guarantee optimal
    solutions
  • Good scaling
  • Flexibility of scoring
  • Flexibility of representation

Review Lones et al GECCO05
8
EC methods for Motif Discovery
  • FMGA Finding Motifs by Genetic Algorithm
  • MDGA Motif Discovery Using A Genetic Algorithm
  • GACluster Identification of Weak Motifs in
    Multiple Biological Sequences using Genetic
    Algorithm
  • St-GA Motif Discovery in Upstream Sequences of
    Coordinately Expressed Genes
  • Discovery, validation, and genetic dissection of
    transcription factor binding sites by comparative
    and functional genomics

9
FMGA Liu et al BIBE04
  • Framework

IUPAC ambiguity codes
10
FMGA Liu et al BIBE04
  • Consensus led consensus generated randomly
  • Fitness function

m is the index of sequences, i is the position
within the motif , n is the index of motif
patterns, k is the length of motif pattern, j is
number of matched regions in the sequence
11
FMGA Liu et al BIBE04
  • IUPAC ambiguity codes
  • Total fitness score function
  • where L is the total number of sequences

12
FMGA Liu et al BIBE04
  • Operators
  • Mutation create a weight matrix from the matched
    motif patterns
  • Mutate those not completely conserved randomly

13
FMGA Liu et al BIBE04
  • Operators
  • Crossover one-point crossover
  • Ambiguity codes penalty

14
FMGA Liu et al BIBE04
  • Rearrangement
  • If the predicted motif pattern is unchanged for
    more than K generations (e.g., K 10)
  • For diversity

15
FMGA Liu et al BIBE04
  • Experiments
  • Compared with MEME and Gibbs sampler. FMGA have
    better prediction results than the others.
  • For the computation time, FMGA is faster than
    MEME and slower than Gibbs sampler.
  • Comments
  • Early method
  • Biological meaning AAA., TTT

16
MDGA Che et al GECCO05
  • Positions-led like Gary B Fogels method
  • Representation concatenated binary encoded
    string
  • Fitness information content
  • Pseudo count (db) is used

Where fb is the observed frequency of nucleotide
b on the column and pb is the background
frequency of the same nucleotide. The summation
is taken over the four possible types of
nucleotides (b). W is the motif width.
17
MDGA Che et al GECCO05
  • Selection
  • Roulette wheel mechanism
  • The phase problem
  • shifting all starting positions to the left or
    right by a small number
  • Crossover operators
  • single-point and double-point
  • Mutation
  • bitwise mutation operator

18
MDGA Che et al GECCO05
  • Experiment Results
  • Crossover operators
  • Mutation Rate 0.01
  • Compared with the Gibbs Sampler and the
    BioProspector
  • on the set of 18 sequences
  • Computational time is better than AlignACE

19
MDGA Che et al GECCO05
Measured by deviation from the true starting
positions ER
  • Experiment Results

Gibbs Sampler
Bio-Prospector
MDGA
20
MDGA Che et al GECCO05
  • Conclusion
  • better prediction accuracy
  • search spaces with a better strategy
  • shorter running time for long sequences
  • The assumption
  • each sequence contains a motif
  • zero to more in real cases

21
GACluster Paul et al GECCO06
  • Consensus-led with GA Alignment technique
  • Tackles with (l, d) motif discovery problem as
    well as weakly conserved motif discovery
  • The framework

Fitness Evaluation Cluster Alignment score of
subsequences
22
GACluster Paul et al GECCO06
  • Consensus from subsequences
  • Focus on weakly conserved motifs
  • Fitness Evaluation
  • the alignment score (Information Content)
  • Non-linear combination

23
GACluster Paul et al GECCO06
  • Fitness Evaluation (cont.)
  • Example
  • AT, AC, AG, AA, AC, TC, AG, TG
  • Clustering and Scoring

24
GACluster Paul et al GECCO06
Min d 1
Min d 1
Min d 1
25
GACluster Paul et al GECCO06
  • Fitness
  • Offspring generation
  • One point crossover
  • Single mutation
  • Dealing with Poly-A and TATA box
  • Reduce the fitness

26
GACluster Paul et al GECCO06
  • Experiments
  • CRP motifs
  • The method with 3 motifs found outperforms the
    binary GA, which finds no real motifs
  • MCB
  • True motifs ACGCGT, ACGCGA, CCGCGT, TCGCGA,
    ACGCGT, ACGCGT Consensus WCGCGW
  • GACluster ACGCGT, ACGCGT, ACGCGT, ACGCGT,
    ACGCGT, ACGCGT Consensus ACGCGT
  • Binary GA TTTCGA, TCACCA, TCACGT, TGACGA,
    TCACGA, TAACGG None are true motifs

True Motifs
27
GACluster Paul et al GECCO06
  • Discussion
  • starting population is very important
  • prevent loss of some initial motifs and to keep
    diversity
  • Conclusions
  • Drawback Like other computational methods of
    motif discovery, our method looks for similar
    subsequences in multiple biological sequences
    many of these similar subsequences have no
    biological significance.

28
St-GA Stine et al CEC03
  • Consensus-led
  • Representation Structured GA with binary
    encoding( two bits for each base)
  • A lt S1, S2 gt S1activation level S2 expression
    level
  • A (ai, aij), (ai 0,1 , i 0.. .14)
    (aij 0,1, i 0.. .14 j 0 . . . l ) .
  • The interpreted string is constructed from A by
    concatenating each two bit block of S2 ai0,
    ai1 for which ai S1 1

Various Lengths of motifs
29
St-GA Stine et al CEC03
  • Use BLAST for alignment
  • bl2seq align the motif (subsequence) against the
    sequences
  • Measures from bl2seq P is the number of
    sequence
  • Also as Fitness
  • Results
  • Can not work with motif shorter than 7
  • (l, d)-motif poor results when dgt1

30
Comparative and functional genomics Gertz et al
Genome Res vol.15 05
  • Biological Experiment Oriented
  • in vivo Validation of de novo motifs
  • GA is used as the computational method
  • Consensus-led randomly generated
  • Fitness
  • Information content I
  • Conservation C Scores from other species
  • Fitness

31
Summary
  • Representation Consensus VS Positions
  • Motif generation From subsequences VS Randomly
    generated (enumeration)
  • Encoding Binary VS Natural
  • Evaluation methods Alignment VS No alignment
  • Fitness function Information content,
    Similarity
  • Prior Knowledge

32
The End
  • Thank you!
  • Q A
Write a Comment
User Comments (0)
About PowerShow.com