Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe

Description:

na PCM element. i position within the motif [Lifanov, 2003] aaGGATTAaaGCATTAaa ... Motif logos for motifs that were identified for ChIP-seq data by MEME and ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 26
Provided by: VasyaS
Category:

less

Transcript and Presenter's Notes

Title: Chipmunk: a fast DNA motif finder for ChIP data and its application to data integration from differe


1
Chipmunk a fast DNA motif finder for ChIP data
and its application to data integration from
different experimental sources
  • Ivan V. Kulakovskiy
  • Valentina A. Boeva
  • Alexander V. Favorov
  • Vsevolod J. Makeev
  • Engelhardt Institute of Molecular Biology,
    Russian Academy of Sciences, Vavilov str. 32,
    Moscow 119991, Russia

2
Overview
  • general concepts and basic ideas
  • current realization and applications
  • following enhancements

3
General concepts
Transcription factor
DNA Recognition motif DNA
Experimental methods - Classic 10 SELEX,
DNAse footprinting etc. - High-throughoutput
1000 ChIP-based
Models incorporating positional dependencies
Motif finding
TF binding sites ..GGATTA.. ..ttGCATTAaa.. ..aact
gtattcgtgatgctaggattaatgatcga ..acacgtagctagctaga
tcgatcgattagacac
Binding site model consensus, magic
word,.. P(C\P\W)M
Motif discovery
Position (Count, Probability, Frequency Weight,
Specific Score) Matrix
4
P(C\P\W)M as the motif model
GCTCATAAAT AACGATAATG CGAAATAAAA TGTCATAAAA AATCAT
AAAA GATCATAAAA AGCCGTAAAA
A C G T
pseudocount (10.254)
1 2 3 4 5 6 7 8 9 10
A C G T
5
From GMLA to PWM
  • Gapless Multiple Local Alignment
  • PCM PP(F)M PW(SS)M

weighted PCM
Sa score of the letter a n total number of
sequences a pseudocount na PCM element i
position within the motif Lifanov, 2003
aaGGATTAaaGCATTAaa ccGGATCAaa GGATCA weight
1 GCATTA weight ½ GGATTA weight ½
wPCM A 0 0 2 0 0 2 G 2 1.5 0 0 0 0 C 0 0.5
0 0 1 0 T 0 0 0 2 1 0
6
Estimating quality of the GMLAthe Discrete
Information Content
I 0 belongs to 100 conservation m motif size
7
Chipmunk assumptions
  • WPCM DIC
  • very similar to PCM Is
  • one-occurrence-per-sequence (OOPS)
  • can have negative affect on performance
  • each data source has the same impact on the
    resulting motif, but within one dataset sequences
    can have different weights
  • able to handle different datasets of different
    size
  • EM accompanied by bootstrapping
  • explicit randomization performance gain (upto
    50)

8
Chipmunk weighting procedure
9
Chipmunk algorithm
10
Chipmunk pseudocode
11
Chipmunk-on-the-web
http//line.imb.ac.ru/Chipmunk/
12
Results on Human ChIP-seq data
The tests were performed on Intel T2500 2GHz Core
Duo running Ubuntu 8.10 with Sun Java 1.6 for
Chipmunk. MEME was used in the one-occurrence-per
-sequence mode with sequences weighted according
to the procedure 2.2.1 and scaled to fit the
standard MEME 01 weight interval. Chipmunk was
run with the standard parameters (100, 10, 1).
Sequences were gathered by CisGenome Ji et. al.
2008 on data from Johnson et. al. 2007, Valouev
et. al. 2008.
13
Results on Human ChIP-seq data
Motif logos for motifs that were identified for
ChIP-seq data by MEME and Chipmunk using
weighting scheme described in 2.2.1. Sequence
weights for MEME were proportionally scaled to
fit the standard MEME 01 weight interval.
TRANSFAC motif represents the matrix available in
TRAHSFAC database. Letters are scaled
proportional to the column DIC. For the GABP
motif MEME result is missing due to enormous
computational time.
14
Data integration for Human Sp1 TF
Sp1 motifs are obtained as is from the TRANSFAC
by Chipmunk from TRRD Sp1 dataset (mostly
footprint-based) and by integrating GABP
ChIP-seq data with the TRRD sequences in the
standard weighting mode.
15
Drosophila TF binding data integration
Bicoid motif was constructed from 329 total
sequences consisted of 48 footprints (Bergman, et
al., 2005), 35 24 SELEX (from curated sets of
C.Bergman and D.Papatsenko), 22 sequences from
bacterial one-hybrid system (Noyes, et al., 2008)
and top 200 ChIP-chip regions (BDTNP project).
16
Different experimental data sources integration
(for D. melanogaster)
  • ChIP-chip
  • BDTNP 7 TFs, 1000 sequences, 1000 bp
  • SELEX
  • C.Bergman 26 TFs, 10, 10
  • Footprinting 2005
  • C.Bergman 87, 1-100, 1-10
  • Noyes bacterial one-hybrid system 2008
  • M.Noyes 26 84 HD, 10, 10
  • For 34 TFs there are two or more
  • data sources

17
iDMMPMM resource
http//line.imb.ac.ru/iDMMPMM/
18
Chipmunk enhancements
  • Already done
  • Multicore CPU support
  • 2x faster using 2 threads, more than 5x faster
    using 8 virtual threads (Sun Java 6 on Core i7)
  • Weight profiles
  • using complete positional information to ensure
    correct motif extraction ChIP-related data or
    some other constraints (like nucleosome
    positioning)
  • Seeds support
  • use the given set of sequences to create starting
    points for the EM
  • To be done
  • Get rid of the OOPS model
  • Improve convergence stability
  • Gather statistics for the algorithm parameters
    limiting iteration numbers

19
Coauthors and affiliations
  • Valentina A. Boeva
  • Curie Institute, 26 rue d'Ulm, 75248, Paris cedex
    05, France
  • State Scientific Institute of Genetics and
    Selection of Industrial Microorganisms,
    GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
    117545, Russia
  • Alexander V. Favorov
  • The Sidney Kimmel Comprehensive Cancer Center at
    Johns Hopkins, Baltimore, MD 21231, USA
  • State Scientific Institute of Genetics and
    Selection of Industrial Microorganisms,
    GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
    117545, Russia
  • Vsevolod J. Makeev
  • State Scientific Institute of Genetics and
    Selection of Industrial Microorganisms,
    GosNIIgenetika, 1st Dorozhny proezd, 1, Moscow
    117545, Russia
  • Engelhardt Institute of Molecular Biology,
    Russian Academy of Sciences, Vavilov str. 32,
    Moscow 119991, Russia

20
Thank you for your attention
21
Supplementary slides
  • EM meets a random subset
  • Chipmunk details
  • DIC and its connection to the classic information
    content Schneider, 1986
  • DIC and its connection to the entropy

22
EM meets a random subset
EMUC EM up to convergence, EMIL EM with the
limited number of iterations
Initial optimization task and a simplified IC
landscape
EMUC on the total data set using random starting
point
EMUC on a subset using random starting point
EMIL on a subset using random starting point
EMIL on a subset using local maxima starting point
23
Chipmunk details
  • Program parameters and length estimation
  • Chipmunk searches the motif space starting from
    the maximum allowed motif length down to the
    minimum stopping on the first strong motif. Each
    sequence in the set has an associated float
    weight which determines its contribution to the
    final motif.
  • sequence weighting model standart for equal
    sequence quality in the set (footprinting, SELEX,
    etc.) ordered - for sequence set ordered by
    sequence quality (so each sequence receives the
    weight inversely proportional to its number in
    the input file intended for ChIP-related data).
  • try-limit- a number of random matrices to
    generate. 100 is suitable for good sets. May be
    extended up to 1000 (and even more for the
    stand-alone version) for precise motif search.
  • step-limit - a number of double-optimization
    steps for each turn of the random matrix
    generation. May be increased up to 100 (and even
    more for the stand-alone version) for precise
    motif search.
  • iteration-limit - a maximum number of iterations
    on the random subset at each double-optimization
    step. Should be one or two (rare cases). Larger
    values will slow down the double-step
    optimization convergence.
  • Strong motif definition
  • We call the motif strong if it has high
    conservation alignment columns somewhere to the
    left and somewhere to the right of any
    no-conservation column. So we allow non
    conservative regions (i.e. fixed length gaps)
    only if they are surrounded by the columns with
    high conservation. In particular this means that
    in case of the strong motif the first and the
    last alignment columns have at least low
    conservation.Column has high conservation if
    its DIC is more or equal to the high conservation
    threshold Thc. Column has low conservation if its
    DIC is between Thc and low conservation threshold
    Tlc. Column has no conservation if its DIC is
    lower than Tlc. Thc is defined as the discrete
    information content over the column where only 3
    of 4 possible nucleotides are present, i. e.
    N/3, N/3, N/3, 0. Tlc threshold is defined as
    discrete information content over the column
    where one pair of nucleotides is 2 times more
    frequent than other pair, i. e. 2N/6, 2N/6, N/6,
    N/6. N here is the total weight of the sequence
    set (the total number of sequences).

24
DIC and its connection to the classic Is
25
DIC and its connection to the entropy
Write a Comment
User Comments (0)
About PowerShow.com