What directs a transcription factor to its target site: New insights learned from ChIPSeq data - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

What directs a transcription factor to its target site: New insights learned from ChIPSeq data

Description:

ChIP-Seq analysis for transcription factors ... Mapping software Eland (Illumina) ... Mouse embryonic fibroblasts Mapping software: Eland (Illumina) ChIP protocol: ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 46
Provided by: csch73
Category:

less

Transcript and Presenter's Notes

Title: What directs a transcription factor to its target site: New insights learned from ChIPSeq data


1
What directs a transcription factor to its target
site New insights learned from ChIP-Seq data
Philipp Bucher Wednesday June 10, 2009Les
Journées Ouvertes en Biologie,Nantes, pays de la
LoireFrance
2
Program for Today
  • Introduction to transcription factor binding
    sites
  • Introduction to ChIP-Seq
  • ChIP-Seq analysis for transcription factors
  • in vivo versus in vitro binding specificity of
    transcription factors
  • How many predicted TF binding sites are occupied
    in vivo
  • Synergistic interactions between adjacent TF
    binding sites
  • Sequence conservation and chromatin context of in
    vivo binding sites

3
Facts about transcription factor binding sites
  • Length 6-20 bp
  • Degenerate sequence motifs
  • Chance occurrence one site in 250 to 50000
    binding sites
  • Up to millions of binding sites in the genome
  • Only a few thousand transcription factor
    molecules per nucleus
  • Most predicted sites in the genome presumably not
    functional
  • However, some weak binding sites were shown to be
    functional
  • Synergistic interaction between adjacent binding
    sites

4
Representation of the binding specificity by a
scoring matrix (also referred to as weight
matrix)
Strong C T T T G A T
C T Binding site 5 5 5 5 5
5 5 3 5 43 Random A C
G T A C G T A Sequence -10
-10 -13 5 -10 -15 -13 -11 - 6 -83

5
Physical interpretation of an weight matrix
Weight matrix elements represent relative binding
energies between DNA base-pairs and protein
surface areas (base-pair acceptor sites). A
weight matrix column describes the base
preferences of a base-pair acceptor site.
6
(No Transcript)
7
Berg-von Hippel model of protein-DNA interactions
The weight matrix score expresses the binding
free energy of a protein-DNA complex in arbitrary
units
It is convenient to express the binding free
energy in dimension-less -RT units
  • On a relative scale, the binding constant for
    sequence x is then given by

The energy terms of a weight matrix can be
computed from the base frequencies pi(b)
estimated from in vitro or in vivo selected
binding sites q(b) is the background frequency
of base b. ? is an unknown parameters related to
the stringency of the binding conditions.
8
ChIP-Seq Technique and Data Structure
Our representation SGA (simple genome
annotation) format
  • Fields of SGA format
  • sequence ID
  • feature
  • position
  • strand
  • Counts
  • optional additional fields

NC_000001.9 stim 559139 -
1NC_000001.9 stim 559333
1NC_000001.9 stim 559356 -
1NC_000001.9 stim 559765 -
1NC_000001.9 stim 559766
3NC_000001.9 stim 559767
1NC_000001.9 stim 559768
1NC_000001.9 stim 559777
3NC_000001.9 stim 559778 2...
9
Example STAT1 ChIP-Seq
Input data Stimulated 15.1 million tags mapped
to genome Unstimultated 12.9 million tags
mapped to genome Mapping software Eland
(Illumina) First control experiment Distribution
of 5 end 3 tags around 37 experimentally
defined STAT1 sites. Results see next
slides Conclusions The average length of
immunoprecipitated fragments is approx 140 bp
(distance between 5 and 3 peaks) Data source
Robertson et al. (2007). Genome-wide profiles of
STAT1 DNA association using chromatin
immunoprecipitation and massively parallel
sequencing. Nat Methods 4, 651-657.
10
Title
Bla, bla, bla
11
Title
Bla, bla, bla
12
Method 1 ChIP-Cor
  • Input
  • genomic tag count distributions for two features
    (reference, target)
  • features may be and - strand tags from same
    experiments
  • applicable to other types of features, e.g. TSS
    positions
  • Output
  • a count correlation histogram
  • computes of times tag pairs are found at a
    particular distance (range) from each other
    (large number of tag pairs !).
  • different normalization options
  • count density of target feature (tags per bp)
  • global ? relative target tag frequency
    (normalized to 1)
  • Purpose
  • identification of average fragment size
  • reveals length distribution of enriched domains
  • provides clues for choosing parameters for peak
    and partitioning algorithms
  • Applications
  • Exploratory analysis
  • Rapid quality control

13
Correlation plot Example 1
Input data Ref STAT1 5 tags Target STAT1
3 tags Analysis parameters Range
-400,600 Window width 10 Count cut-off
3 Y-axis count-density Observations Peak
center 140 Peak count density 0.03 Background
lt 0.007
14
Correlation plot Example 2
Input data Ref CTCF 5 tags Target CTCF 3
tags Analysis parameters Range
-400,400 Window width 5 Count cut-off
3 Y-axis count-density Observations Peak
center 75 Peak count density 0.06 Background
lt 0.002
15
Method 2 ChIP-Center
  • Purpose
  • To map tag counts to expected center position of
    a protein-DNA complex
  • Input
  • Oriented tag counts for a ChIP-Seq features
  • Output
  • centered, un-oriented tag counts
  • Motivation
  • 5 and 3 tag positions show relative
    displacement to each other
  • best estimates for protein-binding site center
    position
  • 5 end position ½ fragment length
  • or 3 end position - ½ fragment length
  • centered tag count distribution more useful for
  • peak recognition and partitioning
  • data viewing in genome browser

16
Auto-correlation of centered ChIP-Seq tags
Input data Ref STAT1 centered-70 Target
STAT1 centered-70 Analysis parameters Range
-400,600 Window width 10 Count cut-off
3 Y-axis count-density Observations Peak
center 0 Peak count density 0.06 Peak volume
5 counts Background lt 0.014
17
Method 3 ChIP-peak
  • Purpose
  • identification of peaks corresponding to in vivo
    protein-DNA complexes
  • Input
  • centered tag counts
  • Output
  • list of peak center positions
  • Method
  • consider only positions which have at least one
    tag count.
  • for each position, determine cumulative tag count
    in window of width w.
  • select as peaks those positions, which
  • have cumulative tag count threshold t.
  • are local maximum with range r.
  • Optional, refinement of peak center (center of
    gravity within window)
  • Interface to sequence analysis programs
  • download of sequences around peak center positions

18
Example ChIP-Peak Locating in vivo STAT1-binding
sites
Input data Robertson et al. (2007) Nature
Methods 4, 651-657. Cell material Interferon
?-stimulated HeLa S3 cells. About 15 million tags
in total Analysis parameters Centering 70bp,
window w200bp, exclusion range r200 bp,
threshold t100 counts Result 4446 peaks,
sequence extraction range for downstream
analysis -1000, 1000
Donwstream sequence analysis Distribution of
TTCNNNGAA around STAT1 peak Sliding window size
50 Figure produced with OPROF (SSA server)
19
Peak extraction how to choose the parameters
How to choose the peak width Based on
auto-correlation plots How to choose the
threshold ? Based on autocorrelation plot Based
on a statistical model (requires some
assumptions) By comparison of the results with a
control set (see below) By measuring the
enrichment of consensus binding sites
20
ChIP-Seq applications to histone modifications
Hypothetical association with functional
chromatin domains H3K4me3 promoters H2AZ promote
rs H3K4me1 enhancers H3K36me3 transcribed
regions H3K27me3 gene silencing
21
ChIP-Seq data histone modifications has
nucleosome resolution
Based on data from Barski et al. 2007, Cell 129,
823-837. ChIP-Seq tags from both strands centered
by 70 bp . WIG file resolution H3K4me1 50 bp,
H3K4me3 10 bp, H2A.Z 25 bp.
22
Chromatin domains are differentially occupied in
different cell lines
Mouse Nanog region, H3K4me3 profiles. ES, ESHyb
embryonic stem cell lines, MEF embryonic
fibroblasts, NP neural progenitors Data from
Mikkelsen et al. (2007). Genome-wide maps of
chromatin state in pluripotent and
lineage-committed cells. Nature 448, 553-560.
23
From ChIP-Seq peaks to TF binding site matrices
Some considerations Ab initio motif discovery or
consensus sequence-initiated refinement ? Speed
(time complexity) of downstream
programs Assumptions about binding site fixed
length or variable lengths Assumptions about
number of different motifs to be found Some
software tools MEME Seed search, EM
refinement in one program Time complexity O(N
2) Can find multiple motifs in on run HMM
training (e.g. MAMOT) Needs initial model
Flexible, can handle palindromes with variable
spacing between half-sites
24
STAT1 Sequence Motif Defined by ChIP-Seq data
Input 4446 ChIP peak regions, 200 bp ab initio
motif discovery MEME (zoops)
Matrix from experimental in vivo sites
Matrix from SELEX
Overall similar matrices, but old data sets too
small for definite conclusions
25
In vivo and in vitro specificity of CTF/NF1
(Data from Milos Pjanic and Nicolas Mermod,
University of Lausanne)
In vivo specificity ChIP-Seq Cellular
source Mouse embryonic fibroblasts Mapping
software Eland (Illumina) ChIP
protocol fixation (formaldehyde) within
cells Sonication Chromatin-IP DNA
extraction Input data for sequence
analysis Total 9.7 million tags within
repeats 3.4 million tags unique regions 6.3
million tags
In vitro specificity HTP-SELEX Weight matrix
from over gt5000 sites
26
Identical in vivo and in vitro binding
specificity for CTF/NF1
  • Initial model

Training set
Training set
Synthetic DNA, SAGE/SELEX, 5579 sequences, Length
25 bp
Mouse genome. ChIP-seq peaks, 1265
sequences, Length 200 bp
EM training
EM training
In vitro specificity
In vivo specificity
27
Identical in vivo and in vitro binding
specificity for CTF/NF1
  • Initial model

Training set
Training set
Synthetic DNA, SAGE/SELEX, 5579 sequences, Length
25 bp
Mouse genome. ChIP-seq peaks, 1265
sequences, Length 200 bp
EM training
EM training
In vitro specificity
In vivo specificity
28
Selective occupancy of genomic TF target sites
General data processing pipeline
ChIP-Seq data
Annotated list of target sites - genome
position - matrix score - count coverage
Peak finding algorithm
Peak positions
Motif finding
genome scan
Binding site weight matrix
Potential target sites in genome
29
Only a small fraction of potential STAT1 sites
occupied in vivo!
A 0 -12 -12 -1 -6 0 0 -8 5 5 2 C
0 -9 -10 5 4 0 -13 -12 -4 -13 -2 G -2
-13 -4 -12 -13 0 4 5 -10 -9 0 T 2 5
5 -8 0 0 -6 -1 -12 -12 0
STAT1 matrix
30
Open questionDoes STAT1 recognize different
types of motifs in vivo!
Another group has identified two different motifs
in the same data sets.
Jothi et al. (2008) Genome-wide identification of
in vivo protein-DNA binding sites from ChIP-Seq
data. Nucleic Acids Res. 365221.
31
The helper motif hypothesis
  • Helper motifs
  • occur in the vicinity in vivo occupied motif
    matches
  • bind to the same or a different transcription
    factor
  • synergistic effects through protein interactions
  • spacing between motifs may be critical
  • symmetric relationship motifs help each other
  • interacting motifs are part of cis-regulatory
    modules (CRMs)

32
How to demonstrate synergistic interactions
between motifs
  • Strategy
  • define matrix for motif A from ChIP-Seq data
  • search genome for motif A matches
  • compile genomic motif A sites with high ChIP tag
    coverage
  • as control set, compile high-scoring matches with
    low coverage
  • Compute distance correlation plot
  • motif B (predicted) against motif A in vivo
    occupied
  • Motif B (predicted) against motif A non-occupied

33
Synergistic motif interactions STAT1 site pairs
  • Biological background
  • STAT proteins can form tetramers
  • known examples of STAT site pairs in functional
    enhancers
  • Hypothesis
  • STAT1 motif pairs preferentially attract STAT1
    protein
  • the optimal distance can be inferred from
    ChIP-Seq data
  • Data sets
  • genomic sites with score 30, coverage 30
  • control genomic sites with score 40, coverage
    0.

Example Il2ra enhancer with 2 STAT5 sites
(Lécine et al. 1997, Mol Cell Biol)
34
Occurrence of STAT1 motifs downstream of occupied
STAT1 motifs
  • Observation
  • 6 of in vivo STAT1 sites have second STAT1 motif
    21 bp downstream
  • very strong preference for exact spacing of 21
    (center-to-center)
  • no second STAT1 sites in control sequences

35
Too good to be true! The MER41B story
  • A number of strange findings led to following
    discovery
  • A rare LTR repetitive element (MER41B) contains
    STAT1 motifs pairs
  • Most STAT1 motif pairs with 21 bp distance come
    from MER41B
  • Most of the STAT1 site pairs in MER41B are
    occupied in vivo
  • MER41B explains motif M2 found by Jothi et al.
    (2008).
  • A sizable fraction (gt5) of in STAT1 sites reside
    in MER41B elements

MER41B ...TGAGTTTTCCGGGAAAGGGGTGGGCAATTCCCGGAACTGA
GGGTT
36
Lesson to be learned mask repeats before
analyzing motifs
Total set of in vivo STAT motif pairs 4307
sites, after masking 2901. A small peak around 20
remains in the repeat masked set!
37
STAT1 motif pair analysis in motif repeat-masked
sequences
  • Observations
  • a clear but lower peak remains after repeat
    masking
  • optimal distance between STAT1 pairs 18-21 bp.
  • weaker over-representation of STAT1 sites up to
    70 bp downstream

38
Correlation of in vivo occupied TFBS conservation
with genomic and epigenetic properties
Data ChIP peak positions for 15 TFs in mouse ES
cells (Chen et al. 2008, Cell) Histone
modificationsfor mouse ES cells (Mikkelsen et al.
2007, Nature) Cross-species conservation scores
(UCSC genome browser database) Transcription
start annotations (ENSEMBL release
50) Methods Feature correlation analysis with
TFBS peaks as reference positions Hierarchical
clustering of TFBS Goal To identify and assess
explanatory variable for in vivo occupancy To
classify transcription factors based on in vivo
binding preferences
39
Cross-Species Conservation Patterns
  • Observations
  • peak size Suz12 gtgt c-Myc, Zfx gt Klf4 gt Sox2,
    Nanog, Smad1 gt Ctcf
  • CTCF has very narrow peak ? stand-alone elements
    ?

40
Promoter association
  • Observations
  • C-Myc, Zfx, Khlf4 strongly promoter associated
    (same peak shape)
  • Suz12 also promoter associated but with broader
    peak
  • other factors show no or only weak promoter
    association

41
H3K4me3 association
  • Observations
  • Same trends as for TSS analysis but broader peak

42
H3K27me3 association repressive chromatin mark
  • Observations
  • Strong and broad peak for Suz12
  • No peaks of negative correlations for all other
    factors

43
Clustering of TF binding sites based by genomic
features
  • TF classes observed
  • Promoter-associated c-Myc, n-Myc, E2f1, Zfx,
    Klf1
  • Weakly conserved Stat3, Tcfcp2l1, Esrrb
  • Nanog/Sox2-like Nanog, Sox2, Smad1, p300, Ctcf
  • Outliers
  • Suz12 (repressive function)
  • Oct4 (clustering artifact ?)

44
Some conclusions
  • ChIP-Seq makes in vivo TF binding sites visible
  • ChIP-Seq combined with motif search reaches
    single bp resolution
  • Transcription factors appear to have same
    specificity in vitro and in vivo
  • Only a small fraction of high-scoring binding
    sites are occupied in vivo
  • Analysis of ChIP-Seq data can identify helper
    motifs
  • Always use repeat-masked sequences for motif
    analysis
  • In vivo sites of transcription factors have
    distinct
  • cross-species conservation patterns
  • Preferences for different types of control
    regions (promoters, others)
  • chromatin context (H3K4me3, H3K27me3, etc)

45
Many thanks to
ChIP-Seq server specifically Giovanna
Ambrosini, Christoph Schmid Other web
servers, data management infrastructure Peter
Sperisen Viviane Praz Important ongoing
collaborations Milos Pjanic, Nicolas Mermod
(UNIL) ChIP-Seq for CTF/NF1 in mouse cells All
researchers who made their ChIP-Seq data publicly
available!
Write a Comment
User Comments (0)
About PowerShow.com