GAME: detecting cisregulatory elements using a genetic algorithm - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

GAME: detecting cisregulatory elements using a genetic algorithm

Description:

Share a common sequence of nucleotides (consensus) ... ( Bioinformatics 2004) The original scoring function is closely related to the simpler function: ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 32
Provided by: cse12
Category:

less

Transcript and Presenter's Notes

Title: GAME: detecting cisregulatory elements using a genetic algorithm


1
GAME detecting cis-regulatory elements using a
genetic algorithm
  • Present Cyrus
  • March 22 2007

2
Outline
  • Introduction
  • Methods
  • Results
  • Discussion

3
Outline
  • Introduction
  • Methods
  • Results
  • Discussion

4
Binding of transcription factor
  • TFBSs
  • In close proximity (100-1000 bps) to a target
    gene
  • The binding sites are short (lt30 bps)
  • Share a common sequence of nucleotides
    (consensus)
  • Specific enough so that the TF protein does not
    bind to many random locations
  • The specificity cannot be absolute in that
    varying binding affinities between the
    transcription and its target sites are required
    for different genes

5
Goal
  • Based on the scoring function used by
    BioOptimizer
  • BioOptimizer very simple (hill climbing), can
    only make local improvements upon another motif
    discovery method
  • Goal
  • Introduces an independent method using the
    scoring function
  • More exhaustive search
  • Eliminates the dependence on other programs
  • GAME Genetic algorithm using such function

6
Outline
  • Introduction
  • Methods
  • Results
  • Discussion

7
Restricted solution space
  • Dataset
  • m upstream sequences, each of length li
  • Sij is the nucleotide in position j of sequence i
  • A motif is presented by
  • a matrix of binding site locations A where each
    Aij1 if position j of sequence i is the motif
    starting site and 0 otherwise
  • They do not expect much more than one binding
    site per sequence multiple occurrence will be
    handled later
  • Then A can be restricted to a vector A(a1,,am),
    where ai0 means no motif

8
An artificial example
  • The upper case letters indicate TFBSs
  • A is (4, 3, 5, 0, 11) in this example

9
Bayesian scoring function for motifs
  • Presented by Jensen et al. (Bioinformatics 2004)
  • The original scoring function is closely related
    to the simpler function
  • A is the number of predicted sites,
  • is the estimated motif abundance out of
  • possible site locations

Information content we discussed before!
10
Standard genetic operators
  • is used by GAME as the fitness function
  • Representation
  • Vector A(a1,,am) where 0ltailtli-w1
  • Mutation operator
  • One point mutation
  • A(a1,,ai,,am) ? A(a1 ,,ai,,am) with
    probability r (0.001, according to De Jongs
    suggestion for a small mutation rate)
  • Crossover
  • Single point crossover

11
Illustration of genetic operators
12
Genetic algorithm configuration
  • Population N500
  • N individual configurations is randomly grouped
    into N/2 pairs each pair produces 2 childrenN/2
    pairs of children in total
  • Tournament selection (replacement) size 2 the
    2N individuals are randomly paired together and
    the one with a better fitness is
    selected into the next generation

13
Additional genetic operators
  • Maximum generation 3000
  • Converge no changes in 50 generations
  • Finally Aopt is obtained from the GA
  • Additional operators
  • Actually not genetic ones but post-processing
    operators which are only applied on Aopt
  • ADJUST
  • SHIFT

14
ADJUST
  • The first type of local optima
  • when a majority of the motif sites have been
    aligned, with a few sites remaining to be aligned
    correctly
  • Even if only one motif site has not been aligned
    correctly, there are so many such local optima
    surrounding the true optimum that our standard
    genetic algorithm is easily trapped in one of
    them
  • Adjust get rid of this type of local optima

15
ADJUST
  • Exhaustively check very possible better site
    position in each sequence
  • Until no further improvement

16
SHIFT
  • All motif sites are slightly mis-aligned

17
PWM-Scan
  • Extract additional motif sites within a set of
    sequences for multiple occurrence
  • Start from Aopt, cycles through all remaining
    potential motif sites
  • selects any additional sites that give a
    superior fitness score
  • Accept a new configuration with additional sites
    Aopt only if

18
The whole framework
GA Part
Post-processing
19
Multiple motifs
  • Iterative-masking approach
  • the binding sites of a discovered motif are
    masked out of the sequence dataset
  • then GAME is re-applied to this masked dataset to
    find additional motifs
  • Another issue of repeated sequences trap
  • the presence of repeated segments of DNA which
    are not transcription factor binding sites
  • be discovered and then masked out by GAME using
    iterative-masking approach

20
Extension to unknown motif width
  • considering the motif width w to also be a random
    variable with a prior distribution p(w), such as
    a Poisson(w0) with prior motif width w0
  • This variable-width model has a more complicated
    scoring function

21
Outline
  • Introduction
  • Methods
  • Results
  • Discussion

22
Simulation evaluation of GAME
  • simulations approximating different biological
    scenarios
  • 200 sequence datasets
  • Number of sequences small (20) or large (100).
  • Width of motif short (8 bp) or long (16 bp).
  • Degree of conservation high (91) or low (70).
  • Data scenario noise-free or noisy (10 contain
    no motifs).

23
Metrics
  • F-scores

24
Noise Free
Noisy
In the noise-free scenario, GAME performs best,
in terms of F score, in each condition
In the noisy scenario, GAME generally shows
superior performance (in terms of F score) over
MEME, BioProspector and BioOptimizer
GAME becomes more predominant as conditions
become more difficult (e.g. lower motif
conservation)
25
Real-data applications
  • CRP 18 sequences, 105 bps, 23 sites
  • ERE 25 sequences, 200 bps, 25 sites (one per
    sequence)
  • E2F family 25 sequences, 200 bps, 27 sites
  • Five additional datasets for the TFs CREB, MEF2,
    MYOD, SRF and TBP from the recently-published ABS
    database of annotated regulatory binding sites

26
Motif width
  • variable-width version of GAME
  • find the optimal width within a range of 3 bps on
    either side of the expected motif width w0
  • Expected motif width w0
  • CRP 22 ERE 13 E2F 11
  • 8, 7, 6, 10 and 6 for CREB, MEF2, MYOD, SRF and
    TBP
  • Also are the widths determined by experiments

27
GAME gives superior recall and comparable
precision when compared to BioOptimizer, MEME and
BioProspector, which results in a better F score
for GAME in each application
28
(No Transcript)
29
Outline
  • Introduction
  • Methods
  • Results
  • Discussion

30
Discussion
  • A Bayesian scoring function for motifs are
    adopted
  • Deals with sequences without motifs, multiple
    occurrences of motifs, varying motif lengths
  • GA is not fully utilized
  • Additional operators are not genetic
  • The refinement methods are post processing and ad
    hoc (only applied on Aopt)
  • An independent and improved method over
    BioOptimizer with global search

31
The End
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com