Title: What directs a transcription factor to its target site: New insights learned from ChIPSeq data
1What directs a transcription factor to its target
site New insights learned from ChIP-Seq data
Philipp Bucher Wednesday June 10, 2009Les
Journées Ouvertes en Biologie,Nantes, pays de la
LoireFrance
2Program for Today
- Introduction to transcription factor binding
sites - Introduction to ChIP-Seq
- ChIP-Seq analysis for transcription factors
- in vivo versus in vitro binding specificity of
transcription factors - How many predicted TF binding sites are occupied
in vivo - Synergistic interactions between adjacent TF
binding sites - Sequence conservation and chromatin context of in
vivo binding sites
3Facts about transcription factor binding sites
- Length 6-20 bp
- Degenerate sequence motifs
- Chance occurrence one site in 250 to 50000
binding sites - Up to millions of binding sites in the genome
- Only a few thousand transcription factor
molecules per nucleus - Most predicted sites in the genome presumably not
functional - However, some weak binding sites were shown to be
functional - Synergistic interaction between adjacent binding
sites
4Representation of the binding specificity by a
scoring matrix (also referred to as weight
matrix)
Strong C T T T G A T
C T Binding site 5 5 5 5 5
5 5 3 5 43 Random A C
G T A C G T A Sequence -10
-10 -13 5 -10 -15 -13 -11 - 6 -83
5Physical interpretation of an weight matrix
Weight matrix elements represent relative binding
energies between DNA base-pairs and protein
surface areas (base-pair acceptor sites). A
weight matrix column describes the base
preferences of a base-pair acceptor site.
6(No Transcript)
7Berg-von Hippel model of protein-DNA interactions
The weight matrix score expresses the binding
free energy of a protein-DNA complex in arbitrary
units
It is convenient to express the binding free
energy in dimension-less -RT units
- On a relative scale, the binding constant for
sequence x is then given by
The energy terms of a weight matrix can be
computed from the base frequencies pi(b)
estimated from in vitro or in vivo selected
binding sites q(b) is the background frequency
of base b. ? is an unknown parameters related to
the stringency of the binding conditions.
8 ChIP-Seq Technique and Data Structure
Our representation SGA (simple genome
annotation) format
- Fields of SGA format
- sequence ID
- feature
- position
- strand
- Counts
- optional additional fields
NC_000001.9 stim 559139 -
1NC_000001.9 stim 559333
1NC_000001.9 stim 559356 -
1NC_000001.9 stim 559765 -
1NC_000001.9 stim 559766
3NC_000001.9 stim 559767
1NC_000001.9 stim 559768
1NC_000001.9 stim 559777
3NC_000001.9 stim 559778 2...
9Example STAT1 ChIP-Seq
Input data Stimulated 15.1 million tags mapped
to genome Unstimultated 12.9 million tags
mapped to genome Mapping software Eland
(Illumina) First control experiment Distribution
of 5 end 3 tags around 37 experimentally
defined STAT1 sites. Results see next
slides Conclusions The average length of
immunoprecipitated fragments is approx 140 bp
(distance between 5 and 3 peaks) Data source
Robertson et al. (2007). Genome-wide profiles of
STAT1 DNA association using chromatin
immunoprecipitation and massively parallel
sequencing. Nat Methods 4, 651-657.
10Title
Bla, bla, bla
11Title
Bla, bla, bla
12Method 1 ChIP-Cor
- Input
- genomic tag count distributions for two features
(reference, target) - features may be and - strand tags from same
experiments - applicable to other types of features, e.g. TSS
positions - Output
- a count correlation histogram
- computes of times tag pairs are found at a
particular distance (range) from each other
(large number of tag pairs !). - different normalization options
- count density of target feature (tags per bp)
- global ? relative target tag frequency
(normalized to 1) - Purpose
- identification of average fragment size
- reveals length distribution of enriched domains
- provides clues for choosing parameters for peak
and partitioning algorithms - Applications
- Exploratory analysis
- Rapid quality control
13Correlation plot Example 1
Input data Ref STAT1 5 tags Target STAT1
3 tags Analysis parameters Range
-400,600 Window width 10 Count cut-off
3 Y-axis count-density Observations Peak
center 140 Peak count density 0.03 Background
lt 0.007
14Correlation plot Example 2
Input data Ref CTCF 5 tags Target CTCF 3
tags Analysis parameters Range
-400,400 Window width 5 Count cut-off
3 Y-axis count-density Observations Peak
center 75 Peak count density 0.06 Background
lt 0.002
15Method 2 ChIP-Center
- Purpose
- To map tag counts to expected center position of
a protein-DNA complex - Input
- Oriented tag counts for a ChIP-Seq features
- Output
- centered, un-oriented tag counts
- Motivation
- 5 and 3 tag positions show relative
displacement to each other - best estimates for protein-binding site center
position - 5 end position ½ fragment length
- or 3 end position - ½ fragment length
- centered tag count distribution more useful for
- peak recognition and partitioning
- data viewing in genome browser
16Auto-correlation of centered ChIP-Seq tags
Input data Ref STAT1 centered-70 Target
STAT1 centered-70 Analysis parameters Range
-400,600 Window width 10 Count cut-off
3 Y-axis count-density Observations Peak
center 0 Peak count density 0.06 Peak volume
5 counts Background lt 0.014
17Method 3 ChIP-peak
- Purpose
- identification of peaks corresponding to in vivo
protein-DNA complexes - Input
- centered tag counts
- Output
- list of peak center positions
- Method
- consider only positions which have at least one
tag count. - for each position, determine cumulative tag count
in window of width w. - select as peaks those positions, which
- have cumulative tag count threshold t.
- are local maximum with range r.
- Optional, refinement of peak center (center of
gravity within window) - Interface to sequence analysis programs
- download of sequences around peak center positions
18Example ChIP-Peak Locating in vivo STAT1-binding
sites
Input data Robertson et al. (2007) Nature
Methods 4, 651-657. Cell material Interferon
?-stimulated HeLa S3 cells. About 15 million tags
in total Analysis parameters Centering 70bp,
window w200bp, exclusion range r200 bp,
threshold t100 counts Result 4446 peaks,
sequence extraction range for downstream
analysis -1000, 1000
Donwstream sequence analysis Distribution of
TTCNNNGAA around STAT1 peak Sliding window size
50 Figure produced with OPROF (SSA server)
19Peak extraction how to choose the parameters
How to choose the peak width Based on
auto-correlation plots How to choose the
threshold ? Based on autocorrelation plot Based
on a statistical model (requires some
assumptions) By comparison of the results with a
control set (see below) By measuring the
enrichment of consensus binding sites
20ChIP-Seq applications to histone modifications
Hypothetical association with functional
chromatin domains H3K4me3 promoters H2AZ promote
rs H3K4me1 enhancers H3K36me3 transcribed
regions H3K27me3 gene silencing
21ChIP-Seq data histone modifications has
nucleosome resolution
Based on data from Barski et al. 2007, Cell 129,
823-837. ChIP-Seq tags from both strands centered
by 70 bp . WIG file resolution H3K4me1 50 bp,
H3K4me3 10 bp, H2A.Z 25 bp.
22Chromatin domains are differentially occupied in
different cell lines
Mouse Nanog region, H3K4me3 profiles. ES, ESHyb
embryonic stem cell lines, MEF embryonic
fibroblasts, NP neural progenitors Data from
Mikkelsen et al. (2007). Genome-wide maps of
chromatin state in pluripotent and
lineage-committed cells. Nature 448, 553-560.
23From ChIP-Seq peaks to TF binding site matrices
Some considerations Ab initio motif discovery or
consensus sequence-initiated refinement ? Speed
(time complexity) of downstream
programs Assumptions about binding site fixed
length or variable lengths Assumptions about
number of different motifs to be found Some
software tools MEME Seed search, EM
refinement in one program Time complexity O(N
2) Can find multiple motifs in on run HMM
training (e.g. MAMOT) Needs initial model
Flexible, can handle palindromes with variable
spacing between half-sites
24STAT1 Sequence Motif Defined by ChIP-Seq data
Input 4446 ChIP peak regions, 200 bp ab initio
motif discovery MEME (zoops)
Matrix from experimental in vivo sites
Matrix from SELEX
Overall similar matrices, but old data sets too
small for definite conclusions
25In vivo and in vitro specificity of CTF/NF1
(Data from Milos Pjanic and Nicolas Mermod,
University of Lausanne)
In vivo specificity ChIP-Seq Cellular
source Mouse embryonic fibroblasts Mapping
software Eland (Illumina) ChIP
protocol fixation (formaldehyde) within
cells Sonication Chromatin-IP DNA
extraction Input data for sequence
analysis Total 9.7 million tags within
repeats 3.4 million tags unique regions 6.3
million tags
In vitro specificity HTP-SELEX Weight matrix
from over gt5000 sites
26Identical in vivo and in vitro binding
specificity for CTF/NF1
Training set
Training set
Synthetic DNA, SAGE/SELEX, 5579 sequences, Length
25 bp
Mouse genome. ChIP-seq peaks, 1265
sequences, Length 200 bp
EM training
EM training
In vitro specificity
In vivo specificity
27Identical in vivo and in vitro binding
specificity for CTF/NF1
Training set
Training set
Synthetic DNA, SAGE/SELEX, 5579 sequences, Length
25 bp
Mouse genome. ChIP-seq peaks, 1265
sequences, Length 200 bp
EM training
EM training
In vitro specificity
In vivo specificity
28Selective occupancy of genomic TF target sites
General data processing pipeline
ChIP-Seq data
Annotated list of target sites - genome
position - matrix score - count coverage
Peak finding algorithm
Peak positions
Motif finding
genome scan
Binding site weight matrix
Potential target sites in genome
29Only a small fraction of potential STAT1 sites
occupied in vivo!
A 0 -12 -12 -1 -6 0 0 -8 5 5 2 C
0 -9 -10 5 4 0 -13 -12 -4 -13 -2 G -2
-13 -4 -12 -13 0 4 5 -10 -9 0 T 2 5
5 -8 0 0 -6 -1 -12 -12 0
STAT1 matrix
30Open questionDoes STAT1 recognize different
types of motifs in vivo!
Another group has identified two different motifs
in the same data sets.
Jothi et al. (2008) Genome-wide identification of
in vivo protein-DNA binding sites from ChIP-Seq
data. Nucleic Acids Res. 365221.
31The helper motif hypothesis
- Helper motifs
- occur in the vicinity in vivo occupied motif
matches - bind to the same or a different transcription
factor - synergistic effects through protein interactions
- spacing between motifs may be critical
- symmetric relationship motifs help each other
- interacting motifs are part of cis-regulatory
modules (CRMs)
32How to demonstrate synergistic interactions
between motifs
- Strategy
- define matrix for motif A from ChIP-Seq data
- search genome for motif A matches
- compile genomic motif A sites with high ChIP tag
coverage - as control set, compile high-scoring matches with
low coverage - Compute distance correlation plot
- motif B (predicted) against motif A in vivo
occupied - Motif B (predicted) against motif A non-occupied
33Synergistic motif interactions STAT1 site pairs
- Biological background
- STAT proteins can form tetramers
- known examples of STAT site pairs in functional
enhancers - Hypothesis
- STAT1 motif pairs preferentially attract STAT1
protein - the optimal distance can be inferred from
ChIP-Seq data - Data sets
- genomic sites with score 30, coverage 30
- control genomic sites with score 40, coverage
0.
Example Il2ra enhancer with 2 STAT5 sites
(Lécine et al. 1997, Mol Cell Biol)
34Occurrence of STAT1 motifs downstream of occupied
STAT1 motifs
- Observation
- 6 of in vivo STAT1 sites have second STAT1 motif
21 bp downstream - very strong preference for exact spacing of 21
(center-to-center) - no second STAT1 sites in control sequences
35Too good to be true! The MER41B story
- A number of strange findings led to following
discovery - A rare LTR repetitive element (MER41B) contains
STAT1 motifs pairs - Most STAT1 motif pairs with 21 bp distance come
from MER41B - Most of the STAT1 site pairs in MER41B are
occupied in vivo - MER41B explains motif M2 found by Jothi et al.
(2008). - A sizable fraction (gt5) of in STAT1 sites reside
in MER41B elements
MER41B ...TGAGTTTTCCGGGAAAGGGGTGGGCAATTCCCGGAACTGA
GGGTT
36Lesson to be learned mask repeats before
analyzing motifs
Total set of in vivo STAT motif pairs 4307
sites, after masking 2901. A small peak around 20
remains in the repeat masked set!
37STAT1 motif pair analysis in motif repeat-masked
sequences
- Observations
- a clear but lower peak remains after repeat
masking - optimal distance between STAT1 pairs 18-21 bp.
- weaker over-representation of STAT1 sites up to
70 bp downstream
38Correlation of in vivo occupied TFBS conservation
with genomic and epigenetic properties
Data ChIP peak positions for 15 TFs in mouse ES
cells (Chen et al. 2008, Cell) Histone
modificationsfor mouse ES cells (Mikkelsen et al.
2007, Nature) Cross-species conservation scores
(UCSC genome browser database) Transcription
start annotations (ENSEMBL release
50) Methods Feature correlation analysis with
TFBS peaks as reference positions Hierarchical
clustering of TFBS Goal To identify and assess
explanatory variable for in vivo occupancy To
classify transcription factors based on in vivo
binding preferences
39Cross-Species Conservation Patterns
- Observations
- peak size Suz12 gtgt c-Myc, Zfx gt Klf4 gt Sox2,
Nanog, Smad1 gt Ctcf - CTCF has very narrow peak ? stand-alone elements
?
40Promoter association
- Observations
- C-Myc, Zfx, Khlf4 strongly promoter associated
(same peak shape) - Suz12 also promoter associated but with broader
peak - other factors show no or only weak promoter
association
41H3K4me3 association
- Observations
- Same trends as for TSS analysis but broader peak
42H3K27me3 association repressive chromatin mark
- Observations
- Strong and broad peak for Suz12
- No peaks of negative correlations for all other
factors
43Clustering of TF binding sites based by genomic
features
- TF classes observed
- Promoter-associated c-Myc, n-Myc, E2f1, Zfx,
Klf1 - Weakly conserved Stat3, Tcfcp2l1, Esrrb
- Nanog/Sox2-like Nanog, Sox2, Smad1, p300, Ctcf
- Outliers
- Suz12 (repressive function)
- Oct4 (clustering artifact ?)
44Some conclusions
- ChIP-Seq makes in vivo TF binding sites visible
- ChIP-Seq combined with motif search reaches
single bp resolution - Transcription factors appear to have same
specificity in vitro and in vivo - Only a small fraction of high-scoring binding
sites are occupied in vivo - Analysis of ChIP-Seq data can identify helper
motifs - Always use repeat-masked sequences for motif
analysis - In vivo sites of transcription factors have
distinct - cross-species conservation patterns
- Preferences for different types of control
regions (promoters, others) - chromatin context (H3K4me3, H3K27me3, etc)
45Many thanks to
ChIP-Seq server specifically Giovanna
Ambrosini, Christoph Schmid Other web
servers, data management infrastructure Peter
Sperisen Viviane Praz Important ongoing
collaborations Milos Pjanic, Nicolas Mermod
(UNIL) ChIP-Seq for CTF/NF1 in mouse cells All
researchers who made their ChIP-Seq data publicly
available!