Title: GAME: detecting cisregulatory elements using a genetic algorithm
1GAME detecting cis-regulatory elements using a
genetic algorithm
- Present Cyrus
- March 22 2007
2Outline
- Introduction
- Methods
- Results
- Discussion
3Outline
- Introduction
- Methods
- Results
- Discussion
4Binding of transcription factor
- TFBSs
- In close proximity (100-1000 bps) to a target
gene - The binding sites are short (lt30 bps)
- Share a common sequence of nucleotides
(consensus) - Specific enough so that the TF protein does not
bind to many random locations - The specificity cannot be absolute in that
varying binding affinities between the
transcription and its target sites are required
for different genes
5Goal
- Based on the scoring function used by
BioOptimizer - BioOptimizer very simple (hill climbing), can
only make local improvements upon another motif
discovery method - Goal
- Introduces an independent method using the
scoring function - More exhaustive search
- Eliminates the dependence on other programs
- GAME Genetic algorithm using such function
6Outline
- Introduction
- Methods
- Results
- Discussion
7Restricted solution space
- Dataset
- m upstream sequences, each of length li
- Sij is the nucleotide in position j of sequence i
- A motif is presented by
- a matrix of binding site locations A where each
Aij1 if position j of sequence i is the motif
starting site and 0 otherwise - They do not expect much more than one binding
site per sequence multiple occurrence will be
handled later - Then A can be restricted to a vector A(a1,,am),
where ai0 means no motif
8An artificial example
- The upper case letters indicate TFBSs
- A is (4, 3, 5, 0, 11) in this example
9Bayesian scoring function for motifs
- Presented by Jensen et al. (Bioinformatics 2004)
- The original scoring function is closely related
to the simpler function - A is the number of predicted sites,
- is the estimated motif abundance out of
- possible site locations
Information content we discussed before!
10Standard genetic operators
- is used by GAME as the fitness function
- Representation
- Vector A(a1,,am) where 0ltailtli-w1
- Mutation operator
- One point mutation
- A(a1,,ai,,am) ? A(a1 ,,ai,,am) with
probability r (0.001, according to De Jongs
suggestion for a small mutation rate) - Crossover
- Single point crossover
11Illustration of genetic operators
12Genetic algorithm configuration
- Population N500
- N individual configurations is randomly grouped
into N/2 pairs each pair produces 2 childrenN/2
pairs of children in total - Tournament selection (replacement) size 2 the
2N individuals are randomly paired together and
the one with a better fitness is
selected into the next generation
13Additional genetic operators
- Maximum generation 3000
- Converge no changes in 50 generations
- Finally Aopt is obtained from the GA
- Additional operators
- Actually not genetic ones but post-processing
operators which are only applied on Aopt - ADJUST
- SHIFT
14ADJUST
- The first type of local optima
- when a majority of the motif sites have been
aligned, with a few sites remaining to be aligned
correctly - Even if only one motif site has not been aligned
correctly, there are so many such local optima
surrounding the true optimum that our standard
genetic algorithm is easily trapped in one of
them - Adjust get rid of this type of local optima
15ADJUST
- Exhaustively check very possible better site
position in each sequence - Until no further improvement
16SHIFT
- All motif sites are slightly mis-aligned
17PWM-Scan
- Extract additional motif sites within a set of
sequences for multiple occurrence - Start from Aopt, cycles through all remaining
potential motif sites - selects any additional sites that give a
superior fitness score - Accept a new configuration with additional sites
Aopt only if
18The whole framework
GA Part
Post-processing
19Multiple motifs
- Iterative-masking approach
- the binding sites of a discovered motif are
masked out of the sequence dataset - then GAME is re-applied to this masked dataset to
find additional motifs - Another issue of repeated sequences trap
- the presence of repeated segments of DNA which
are not transcription factor binding sites - be discovered and then masked out by GAME using
iterative-masking approach
20Extension to unknown motif width
- considering the motif width w to also be a random
variable with a prior distribution p(w), such as
a Poisson(w0) with prior motif width w0 - This variable-width model has a more complicated
scoring function
21Outline
- Introduction
- Methods
- Results
- Discussion
22Simulation evaluation of GAME
- simulations approximating different biological
scenarios - 200 sequence datasets
- Number of sequences small (20) or large (100).
- Width of motif short (8 bp) or long (16 bp).
- Degree of conservation high (91) or low (70).
- Data scenario noise-free or noisy (10 contain
no motifs).
23Metrics
24Noise Free
Noisy
In the noise-free scenario, GAME performs best,
in terms of F score, in each condition
In the noisy scenario, GAME generally shows
superior performance (in terms of F score) over
MEME, BioProspector and BioOptimizer
GAME becomes more predominant as conditions
become more difficult (e.g. lower motif
conservation)
25Real-data applications
- CRP 18 sequences, 105 bps, 23 sites
- ERE 25 sequences, 200 bps, 25 sites (one per
sequence) - E2F family 25 sequences, 200 bps, 27 sites
- Five additional datasets for the TFs CREB, MEF2,
MYOD, SRF and TBP from the recently-published ABS
database of annotated regulatory binding sites
26Motif width
- variable-width version of GAME
- find the optimal width within a range of 3 bps on
either side of the expected motif width w0 - Expected motif width w0
- CRP 22 ERE 13 E2F 11
- 8, 7, 6, 10 and 6 for CREB, MEF2, MYOD, SRF and
TBP - Also are the widths determined by experiments
27GAME gives superior recall and comparable
precision when compared to BioOptimizer, MEME and
BioProspector, which results in a better F score
for GAME in each application
28(No Transcript)
29Outline
- Introduction
- Methods
- Results
- Discussion
30Discussion
- A Bayesian scoring function for motifs are
adopted - Deals with sequences without motifs, multiple
occurrences of motifs, varying motif lengths - GA is not fully utilized
- Additional operators are not genetic
- The refinement methods are post processing and ad
hoc (only applied on Aopt) - An independent and improved method over
BioOptimizer with global search
31The End