Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe

Description:

na PCM element. i position within the motif [Lifanov, 2003] aaGGATTAaaGCATTAaa ... Motif logos for motifs that were identified for ChIP-seq data by MEME and ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 26

Provided by: VasyaS

Category:

more less

Transcript and Presenter's Notes

Title: Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe

1
Chipmunk a fast DNA motif finder for ChIP data
and its application to data integration from
different experimental sources

Ivan V. Kulakovskiy
Valentina A. Boeva
Alexander V. Favorov
Vsevolod J. Makeev
Engelhardt Institute of Molecular Biology,
Russian Academy of Sciences, Vavilov str. 32,
Moscow 119991, Russia

2
Overview

general concepts and basic ideas
current realization and applications
following enhancements

3
General concepts
Transcription factor
DNA Recognition motif DNA
Experimental methods - Classic 10 SELEX,
DNAse footprinting etc. - High-throughoutput
1000 ChIP-based
Models incorporating positional dependencies
Motif finding
TF binding sites ..GGATTA.. ..ttGCATTAaa.. ..aact
gtattcgtgatgctaggattaatgatcga ..acacgtagctagctaga
tcgatcgattagacac
Binding site model consensus, magic
word,.. P(C\P\W)M
Motif discovery
Position (Count, Probability, Frequency Weight,
Specific Score) Matrix
4
P(C\P\W)M as the motif model
GCTCATAAAT AACGATAATG CGAAATAAAA TGTCATAAAA AATCAT
AAAA GATCATAAAA AGCCGTAAAA
A C G T
pseudocount (10.254)
1 2 3 4 5 6 7 8 9 10
A C G T
5
From GMLA to PWM

Gapless Multiple Local Alignment
PCM PP(F)M PW(SS)M

weighted PCM
Sa score of the letter a n total number of
sequences a pseudocount na PCM element i
position within the motif Lifanov, 2003
aaGGATTAaaGCATTAaa ccGGATCAaa GGATCA weight
1 GCATTA weight ½ GGATTA weight ½
wPCM A 0 0 2 0 0 2 G 2 1.5 0 0 0 0 C 0 0.5
0 0 1 0 T 0 0 0 2 1 0
6
Estimating quality of the GMLAthe Discrete
Information Content
I 0 belongs to 100 conservation m motif size
7
Chipmunk assumptions

WPCM DIC
very similar to PCM Is
one-occurrence-per-sequence (OOPS)
can have negative affect on performance
each data source has the same impact on the
resulting motif, but within one dataset sequences
can have different weights
able to handle different datasets of different
size
EM accompanied by bootstrapping
explicit randomization performance gain (upto
50)

8
Chipmunk weighting procedure
9
Chipmunk algorithm
10
Chipmunk pseudocode
11
Chipmunk-on-the-web
http//line.imb.ac.ru/Chipmunk/
12
Results on Human ChIP-seq data
The tests were performed on Intel T2500 2GHz Core
Duo running Ubuntu 8.10 with Sun Java 1.6 for
Chipmunk. MEME was used in the one-occurrence-per
-sequence mode with sequences weighted according
to the procedure 2.2.1 and scaled to fit the
standard MEME 01 weight interval. Chipmunk was
run with the standard parameters (100, 10, 1).
Sequences were gathered by CisGenome Ji et. al.
2008 on data from Johnson et. al. 2007, Valouev
et. al. 2008.
13
Results on Human ChIP-seq data
Motif logos for motifs that were identified for
ChIP-seq data by MEME and Chipmunk using
weighting scheme described in 2.2.1. Sequence
weights for MEME were proportionally scaled to
fit the standard MEME 01 weight interval.
TRANSFAC motif represents the matrix available in
TRAHSFAC database. Letters are scaled
proportional to the column DIC. For the GABP
motif MEME result is missing due to enormous
computational time.
14
Data integration for Human Sp1 TF
Sp1 motifs are obtained as is from the TRANSFAC
by Chipmunk from TRRD Sp1 dataset (mostly
footprint-based) and by integrating GABP
ChIP-seq data with the TRRD sequences in the
standard weighting mode.
15
Drosophila TF binding data integration
Bicoid motif was constructed from 329 total
sequences consisted of 48 footprints (Bergman, et
al., 2005), 35 24 SELEX (from curated sets of
C.Bergman and D.Papatsenko), 22 sequences from
bacterial one-hybrid system (Noyes, et al., 2008)
and top 200 ChIP-chip regions (BDTNP project).
16
Different experimental data sources integration
(for D. melanogaster)

ChIP-chip
BDTNP 7 TFs, 1000 sequences, 1000 bp
SELEX
C.Bergman 26 TFs, 10, 10
Footprinting 2005
C.Bergman 87, 1-100, 1-10
Noyes bacterial one-hybrid system 2008
M.Noyes 26 84 HD, 10, 10
For 34 TFs there are two or more
data sources

17
iDMMPMM resource
http//line.imb.ac.ru/iDMMPMM/
18
Chipmunk enhancements

Already done
Multicore CPU support
2x faster using 2 threads, more than 5x faster
using 8 virtual threads (Sun Java 6 on Core i7)
Weight profiles
using complete positional information to ensure
correct motif extraction ChIP-related data or
some other constraints (like nucleosome
positioning)
Seeds support
use the given set of sequences to create starting
points for the EM
To be done
Get rid of the OOPS model
Improve convergence stability
Gather statistics for the algorithm parameters
limiting iteration numbers

19
Coauthors and affiliations

Valentina A. Boeva
Curie Institute, 26 rue d'Ulm, 75248, Paris cedex
05, France
State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia
Alexander V. Favorov
The Sidney Kimmel Comprehensive Cancer Center at
Johns Hopkins, Baltimore, MD 21231, USA
State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia
Vsevolod J. Makeev
State Scientific Institute of Genetics and
Selection of Industrial Microorganisms,
GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
117545, Russia
Engelhardt Institute of Molecular Biology,
Russian Academy of Sciences, Vavilov str. 32,
Moscow 119991, Russia

20
Thank you for your attention
21
Supplementary slides

EM meets a random subset
Chipmunk details
DIC and its connection to the classic information
content Schneider, 1986
DIC and its connection to the entropy

22
EM meets a random subset
EMUC EM up to convergence, EMIL EM with the
limited number of iterations
Initial optimization task and a simplified IC
landscape
EMUC on the total data set using random starting
point
EMUC on a subset using random starting point
EMIL on a subset using random starting point
EMIL on a subset using local maxima starting point
23
Chipmunk details

Program parameters and length estimation
Chipmunk searches the motif space starting from
the maximum allowed motif length down to the
minimum stopping on the first strong motif. Each
sequence in the set has an associated float
weight which determines its contribution to the
final motif.
sequence weighting model standart for equal
sequence quality in the set (footprinting, SELEX,
etc.) ordered - for sequence set ordered by
sequence quality (so each sequence receives the
weight inversely proportional to its number in
the input file intended for ChIP-related data).
try-limit- a number of random matrices to
generate. 100 is suitable for good sets. May be
extended up to 1000 (and even more for the
stand-alone version) for precise motif search.
step-limit - a number of double-optimization
steps for each turn of the random matrix
generation. May be increased up to 100 (and even
more for the stand-alone version) for precise
motif search.
iteration-limit - a maximum number of iterations
on the random subset at each double-optimization
step. Should be one or two (rare cases). Larger
values will slow down the double-step
optimization convergence.
Strong motif definition
We call the motif strong if it has high
conservation alignment columns somewhere to the
left and somewhere to the right of any
no-conservation column. So we allow non
conservative regions (i.e. fixed length gaps)
only if they are surrounded by the columns with
high conservation. In particular this means that
in case of the strong motif the first and the
last alignment columns have at least low
conservation.Column has high conservation if
its DIC is more or equal to the high conservation
threshold Thc. Column has low conservation if its
DIC is between Thc and low conservation threshold
Tlc. Column has no conservation if its DIC is
lower than Tlc. Thc is defined as the discrete
information content over the column where only 3
of 4 possible nucleotides are present, i. e.
N/3, N/3, N/3, 0. Tlc threshold is defined as
discrete information content over the column
where one pair of nucleotides is 2 times more
frequent than other pair, i. e. 2N/6, 2N/6, N/6,
N/6. N here is the total weight of the sequence
set (the total number of sequences).