Title: Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe
1Chipmunk a fast DNA motif finder for ChIP data
and its application to data integration from
different experimental sources
- Ivan V. Kulakovskiy
- Valentina A. Boeva
- Alexander V. Favorov
- Vsevolod J. Makeev
- Engelhardt Institute of Molecular Biology,
Russian Academy of Sciences, Vavilov str. 32,
Moscow 119991, Russia
2Overview
- general concepts and basic ideas
- current realization and applications
- following enhancements
3General concepts
Transcription factor
DNA Recognition motif DNA
Experimental methods - Classic 10 SELEX,
DNAse footprinting etc. - High-throughoutput
1000 ChIP-based
Models incorporating positional dependencies
Motif finding
TF binding sites ..GGATTA.. ..ttGCATTAaa.. ..aact
gtattcgtgatgctaggattaatgatcga ..acacgtagctagctaga
tcgatcgattagacac
Binding site model consensus, magic
word,.. P(C\P\W)M
Motif discovery
Position (Count, Probability, Frequency Weight,
Specific Score) Matrix
4P(C\P\W)M as the motif model
GCTCATAAAT AACGATAATG CGAAATAAAA TGTCATAAAA AATCAT
AAAA GATCATAAAA AGCCGTAAAA
A C G T
pseudocount (10.254)
1 2 3 4 5 6 7 8 9 10
A C G T
5From GMLA to PWM
- Gapless Multiple Local Alignment
- PCM PP(F)M PW(SS)M
weighted PCM
Sa score of the letter a n total number of
sequences a pseudocount na PCM element i
position within the motif Lifanov, 2003
aaGGATTAaaGCATTAaa ccGGATCAaa GGATCA weight
1 GCATTA weight ½ GGATTA weight ½
wPCM A 0 0 2 0 0 2 G 2 1.5 0 0 0 0 C 0 0.5
0 0 1 0 T 0 0 0 2 1 0
6Estimating quality of the GMLAthe Discrete
Information Content
I 0 belongs to 100 conservation m motif size
7Chipmunk assumptions
- WPCM DIC
- very similar to PCM Is
- one-occurrence-per-sequence (OOPS)
- can have negative affect on performance
- each data source has the same impact on the
resulting motif, but within one dataset sequences
can have different weights - able to handle different datasets of different
size - EM accompanied by bootstrapping
- explicit randomization performance gain (upto
50)
8Chipmunk weighting procedure
9Chipmunk algorithm
10Chipmunk pseudocode
11Chipmunk-on-the-web
http//line.imb.ac.ru/Chipmunk/
12Results on Human ChIP-seq data
The tests were performed on Intel T2500 2GHz Core
Duo running Ubuntu 8.10 with Sun Java 1.6 for
Chipmunk. MEME was used in the one-occurrence-per
-sequence mode with sequences weighted according
to the procedure 2.2.1 and scaled to fit the
standard MEME 01 weight interval. Chipmunk was
run with the standard parameters (100, 10, 1).
Sequences were gathered by CisGenome Ji et. al.
2008 on data from Johnson et. al. 2007, Valouev
et. al. 2008.
13Results on Human ChIP-seq data
Motif logos for motifs that were identified for
ChIP-seq data by MEME and Chipmunk using
weighting scheme described in 2.2.1. Sequence
weights for MEME were proportionally scaled to
fit the standard MEME 01 weight interval.
TRANSFAC motif represents the matrix available in
TRAHSFAC database. Letters are scaled
proportional to the column DIC. For the GABP
motif MEME result is missing due to enormous
computational time.
14Data integration for Human Sp1 TF
Sp1 motifs are obtained as is from the TRANSFAC
by Chipmunk from TRRD Sp1 dataset (mostly
footprint-based) and by integrating GABP
ChIP-seq data with the TRRD sequences in the
standard weighting mode.
15Drosophila TF binding data integration
Bicoid motif was constructed from 329 total
sequences consisted of 48 footprints (Bergman, et
al., 2005), 35 24 SELEX (from curated sets of
C.Bergman and D.Papatsenko), 22 sequences from
bacterial one-hybrid system (Noyes, et al., 2008)
and top 200 ChIP-chip regions (BDTNP project).
16Different experimental data sources integration
(for D. melanogaster)
- ChIP-chip
- BDTNP 7 TFs, 1000 sequences, 1000 bp
- SELEX
- C.Bergman 26 TFs, 10, 10
- Footprinting 2005
- C.Bergman 87, 1-100, 1-10
- Noyes bacterial one-hybrid system 2008
- M.Noyes 26 84 HD, 10, 10
- For 34 TFs there are two or more
- data sources
17iDMMPMM resource
http//line.imb.ac.ru/iDMMPMM/
18Chipmunk enhancements
- Already done
- Multicore CPU support
- 2x faster using 2 threads, more than 5x faster
using 8 virtual threads (Sun Java 6 on Core i7) - Weight profiles
- using complete positional information to ensure
correct motif extraction ChIP-related data or
some other constraints (like nucleosome
positioning) - Seeds support
- use the given set of sequences to create starting
points for the EM - To be done
- Get rid of the OOPS model
- Improve convergence stability
- Gather statistics for the algorithm parameters
limiting iteration numbers
19Coauthors and affiliations
- Valentina A. Boeva
- Curie Institute, 26 rue d'Ulm, 75248, Paris cedex
05, France - State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia - Alexander V. Favorov
- The Sidney Kimmel Comprehensive Cancer Center at
Johns Hopkins, Baltimore, MD 21231, USA - State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia - Vsevolod J. Makeev
- State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia - Engelhardt Institute of Molecular Biology,
Russian Academy of Sciences, Vavilov str. 32,
Moscow 119991, Russia
20Thank you for your attention
21Supplementary slides
- EM meets a random subset
- Chipmunk details
- DIC and its connection to the classic information
content Schneider, 1986 - DIC and its connection to the entropy
22EM meets a random subset
EMUC EM up to convergence, EMIL EM with the
limited number of iterations
Initial optimization task and a simplified IC
landscape
EMUC on the total data set using random starting
point
EMUC on a subset using random starting point
EMIL on a subset using random starting point
EMIL on a subset using local maxima starting point
23Chipmunk details
- Program parameters and length estimation
- Chipmunk searches the motif space starting from
the maximum allowed motif length down to the
minimum stopping on the first strong motif. Each
sequence in the set has an associated float
weight which determines its contribution to the
final motif. - sequence weighting model standart for equal
sequence quality in the set (footprinting, SELEX,
etc.) ordered - for sequence set ordered by
sequence quality (so each sequence receives the
weight inversely proportional to its number in
the input file intended for ChIP-related data). - try-limit- a number of random matrices to
generate. 100 is suitable for good sets. May be
extended up to 1000 (and even more for the
stand-alone version) for precise motif search. - step-limit - a number of double-optimization
steps for each turn of the random matrix
generation. May be increased up to 100 (and even
more for the stand-alone version) for precise
motif search. - iteration-limit - a maximum number of iterations
on the random subset at each double-optimization
step. Should be one or two (rare cases). Larger
values will slow down the double-step
optimization convergence. - Strong motif definition
- We call the motif strong if it has high
conservation alignment columns somewhere to the
left and somewhere to the right of any
no-conservation column. So we allow non
conservative regions (i.e. fixed length gaps)
only if they are surrounded by the columns with
high conservation. In particular this means that
in case of the strong motif the first and the
last alignment columns have at least low
conservation.Column has high conservation if
its DIC is more or equal to the high conservation
threshold Thc. Column has low conservation if its
DIC is between Thc and low conservation threshold
Tlc. Column has no conservation if its DIC is
lower than Tlc. Thc is defined as the discrete
information content over the column where only 3
of 4 possible nucleotides are present, i. e.
N/3, N/3, N/3, 0. Tlc threshold is defined as
discrete information content over the column
where one pair of nucleotides is 2 times more
frequent than other pair, i. e. 2N/6, 2N/6, N/6,
N/6. N here is the total weight of the sequence
set (the total number of sequences).
24DIC and its connection to the classic Is
25DIC and its connection to the entropy