Title: Highthroughput SELEXSAGE method for quantitative modeling of transcriptionfactor binding sites
1High-throughput SELEX-SAGE method for
quantitative modeling of transcription-factor
binding sites
- Emmanuelle Roulet, Stephane Busso, Anamaria A.
Camargo, Andrew H.G. Simpson, Nicolas Mermod,
Philipp Bucher - Nature Biotechnology August 2002 Vol. 20 p831-835
BioNetworks Journal Club -- Feb. 24, 2003 Slides
prepared by Alison Hottes
2General Background
- Transcription factors bind to DNA and enhance or
inhibit gene transcription. - Transcription factor binding is DNA-sequence
dependent.
3General Problems
- Want to figure out which transcription factors
bind where and under what conditions - This knowledge specifies a large part of the
connectivity and topology of a cells
transcriptional network - From a proteins amino acid sequence or structure
we cant currently predict its binding
specificity.
4Common Approaches
- Take a few known binding sites for a
transcription factor and generalize to a
consensus sequence. - Find genes likely regulated by a transcription
factor (e.g. microarrays). Search the upstream
DNA sequence of those genes for motifs which
might be the binding site. - Take the consensus and see where else it appears
in a genome. - In both methods, the accuracy is variable and
often unknown.
5How complex a model is needed?
- Example Suppose a transcription factor binds to
a 25 base pair DNA region. - Option 1 Assign a score for the binding affinity
for each possible 25mer - Need 425 scores (completely unrealistic)
- Option 2 Assume each base position acts
independently and equally. - Need 325 parameters. (Still a lot.)
- Gaps are more difficult.
6For any model
- Need a lot of data
- How can we get the data?
- What can we do with the data once we have it?
- This paper illustrates one approach for CTF/NFI
- Binds as a homodimer
- Recognizes sequences like TTGGC(N5)GCCAA
7Initial Model
- Weight matrix
- Models multiple binding modes
- Given a sequence, can produce a score.
- How would we generate sequences consistent with
the model?
8Generating Synthetic Sequences
- Generate a HMM from the weights.
- Model depends on the average score desired.
- Use each HMM to generate sequences.
9Experiment Design
- General plan Find sequences that bind CTF/NFI
and use them to refine the original model. - Design question What affinity (low/ medium/
high) should the sequences have for CTF/NFI? - Approach Use sets of synthetic training
sequences to estimate weights. Compare to
original weights. - Conclusion Low-affinity sequences give more
accurate models.
10SELEX Systematic evolution of ligands by
exponential enrichment
SAGE Serial analysis of gene expression
11Controlling Binding Affinity
- Add radioactive medium-affinity 25mer probe.
- Want 50 of probe to be competed away by library
at each step. - Stop when library and probe bind equally well ?
library has medium affinity.
12SELEX Systematic evolution of ligands by
exponential enrichment
SAGE Serial analysis of gene expression
13Affinity of Each Cycle
- Fraction of binding sequences increases each
repetition. - Average affinity of binding sequences is
maintained.
14New Model
HMM from SELEX 3
Weight matrix model.
- Each half-site is now 6 bases.
- Model is less tolerant of spacing variation.
- Weights shifted (e.g. now more tolerant of
adenines in positions 2 and 4).
15Cross-validation
- Model is reasonably stable to sampling variations
in the dataset.
16Experimental Verification
- Scores correlate well with binding affinity.
- Used model to search Eukaryotic Promoter Database
for binding candidates. Showed CTF/NFI induction
of some candidates.
17Non-independence Between Bases
- Checked SELEX sequences for dinucleotide
correlations. - Found dependencies. Add less than 1 bit to 12
bit model.
18Comments and Questions
- Did the improvement justify the work?
- Should the background DNA sequence be modeled?
- Could this be done for an arbitrary transcription
factor? - Is it necessary to know a medium affinity
sequence? - Must know binding partners.
- Must know what modifications (e.g.
phosphorylation) are needed for binding. - Need extract with active protein