Title: Discovering cis-regulatory motifs using genome-wide sequence and expression data
1Discovering cis-regulatory motifs using
genome-wide sequence and expression data
- Chaim Linhart, Yonit Halperin,
- Igor Ulitsky, Ron Shamir
2Gene expression regulation
- Transcription is regulated mainly by
transcription factors (TFs) - proteins that bind
to DNA subsequences, called binding sites (BSs) - TFBSs are located mainly in the genes promoter
the DNA sequence upstream the genes
transcription start site (TSS) - TFs can promote or repress transcription
- Other regulators micro-RNAs (miRNAs)
3TFBS models
- The BSs of a particular TF share a common
pattern, or motif, which is often modeled using - Degenerate stringGGWATB (WA,T, BC,G,T)
- PWM Position weight matrix
ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA
- Cutoff 0.009
- AGCTACACCCATTTAT 0.06
- AGTAGAGCCTTCGTG 0.06
- CGATTCTACAATATGA 0.01
6 5 4 3 2 1
0 0.2 0.7 0 0.8 0.1 A
0.6 0.4 0.1 0.5 0.1 0 C
0.1 0.4 0.1 0.5 0 0 G
0.3 0 0.1 0 0.1 0.9 T
4Motif discovery The typical two-step pipeline
Promoter/3UTRsequences
Co-regulated gene set
5Motif discovery Goals and challenges
- Goal Reverse-engineer the transcriptional
regulatory network - Challenges
- BSs are short and degenerate (non-specific)
- Promoters are long complex (hard to model)
- Search space is huge (motif and sequence)
- Data is noisy
- What to look for? (enriched?, localized?,
conserved?) - Problem is still considered very difficult
despite extensive research Tompa 05
6Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
- Supports diverse motif discovery tasks
- Finding over-represented motifs in one or more
given sets of genes. - Identifying motifs with global spatial features
given only the genomic sequences. - Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets. - How?
- A general pipeline architecture for enumerating
motifs. - Different statistical scoring schemes of motifs
for different motif discovery tasks.
7Motif search algorithm
- Pipeline of refinement phases
- Each phase receives best candidates of previous
phase, and refines them - First phases are simple and fast (e.g., try all
k-mers) Last phases are more complex (e.g.,
optimize PWM)
8PWM optimization phase
9Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
- Supports diverse motif discovery tasks
- Finding over-represented motifs in one or more
given sets of genes. - Identifying motifs with global spatial features
given only the genomic sequences. - Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets. - How?
- A general pipeline architecture for enumerating
motifs. - Different statistical scoring schemes of motifs
for different motif discovery tasks.
10Task I Over-represented motifs in given
target set
- Input Target set (T) co-regulated genes
Background (BG) set (B) entire genome - No sequence model is assumed!
- Motif scoringHypergeometric (HG) enrichment
score - b, t BG/Target genes containing a hit
! BG set should be of the same nature as the
target set, and much largerE.g., all genes on
microarray
11Task I Over-represented motifs in given
target set
- Input Target set (T) co-regulated genes
Background (BG) set (B) entire genome - No sequence model is assumed!
- Motif scoringHypergeometric (HG) enrichment
score - b, t BG/Target genes containing a hit
12Drawback of the HG score
- Length/GC-content distribution in the target set
might significantly differ from the distribution
in the BG set - Very common in practice due to correlation
between the expression/function of genes and the
length/GC-content of their promoters and 3 UTRs - The HG score might fail to discover the correct
motif or detect many spurious motifs - ? Use the binned enrichment score
- Slightly less sensitive than HG score
- but takes into account length/GC-content biases
13Drawback of the HG score
- Length/GC-content distribution in the target set
might significantly differ from the distribution
in the BG set - Very common in practice due to correlation
between the expression/function of genes and the
length/GC-content of their promoters and 3 UTRs - H0 assumes uniform sampling
- The HG score might fail to discover the correct
motif or detect many spurious motifs
14Binned enrichment score
GC-content
- Key idea Binning sequences
- Bi, Ti BG/Target genes in i-th bin
- bimotif hits in i-th bin. t bnT
- Bins sampling probability
- Assume uniform sampling per bin
Length
- pm prob. of a target set gene to contain a hit
- Assume that T target genes are sampled with
replacement from B
15Test case Human G2M cell-cycle genes
- Input 350 genes expressed in the human G2M
cell-cycle phases Whitfield et al. 02
CHR
Pairs analysis
NF-Y (CCAAT-box)
- These motifs form a module associated with G2M
Elkon et al. 03 ,Tabach et al. 05, Linhart et
al. 05
16Results Human G2M cell-cycle genes
- 350 genes expressed in the human G2M cell-cycle
phases Whitfield et al. 02.
CHR
CCAAT-box
- Both motif are associated with G2M Elkon et al.
03 ,Tabach et al. 05, Linhart et al. 05.
17Benchmark I Yeast TF target sets
- Source ChIP-chip Harbison et al., 04
- Data 173 target-sets of 83 TFs with known BS
motifs - Average set size 58 genes (35 Kbps)
- Success rates (for top 2 motifs of lengths 8
10)
18BenchmarkReal-life metazoan datasets
- We constructed the first motif discovery
benchmark that is based on a large compendium of
experimental studies - Source Various (expression, ChIP-chip, Gene
Ontology, ) - Data 42 target-sets of 26 TFs and 8 miRNAs from
29 publications - Species human, mouse,
- fly, worm
- Average set size
- 400 genes (383 Kbps)
Binned score improvement
19Similarity between two motifs
- Euclidean
- Pearson correlation coefficient
- Kullback-Leibler divergence (relative entropy)
20Metazoan benchmarkOther assessment methods for
success rate
21Metazoan benchmark Detailed results
22Binned score - examples
- Mef2 fly target-set Blais et al. 05
- Promoters longer than average (972bp vs. 840bp)
- Promoter have higher GC-content (53 vs. 49)
- None of the programs discovered the correct motif
- Binned score -gt Mef2 is the top-scoring motif
- hsa-miR-16 target-set Linsley et al. 07
- 3UTRs longer than average (1700bp vs. 960)
- HG score hsa-miR-16 signature is the top scoring
motif. But 10 more motifs with plt1E-14. - Binned score the correct motif p1.7E-33. No
spurious motifs. - 198 mouse odorant receptors promoters Michaloski
et al. 06 - highly AT-rich (35 vs. 25 in the BG)
- HG score Olf-1 was the third best motif after
AT-rich motifs - Binned score Olf-1 top scoring motifs
23Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
- Supports diverse motif discovery tasks
- Finding over-represented motifs in one or more
given sets of genes. - Identifying motifs with global spatial features
given only the genomic sequences. - Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets. - How?
- A general pipeline architecture for enumerating
motifs. - Different statistical scoring schemes of motifs
for different motif discovery tasks.
24Amadeus Global spatial analysis
Co-regulated gene set
Gene expressionmicroarrays
Location analysis (ChIP-chip, )
Promotersequences
Functional group (e.g., GO term)
Output
Motif(s)
25Task II Global analyses
Scores for spatial features of motif
occurrences Input Sequences (no target-set /
expression data)
Motif scoring
- Localization w.r.t the TSS
- Strand-bias
- Chromosomal preference
26Global analysis ILocalized human mouse motifs
- Input
- All human mouse promoters (2 x 20,000)
- Score localization
27Global analysis IIChromosomal preference in C.
elegans
- Input
- All worm promoters (18,000)
- Score chromosomal preference
Results Novel motif on chrom IV
28Amadeus is available at
- Transcription factor and microRNA motif
discovery The Amadeus platform and a compendium
of metazoan target sets, - C. Linhart, Y. Halperin, R. Shamir, Genome
Research 187, 2008 - (equal contribution)
http//acgt.cs.tau.ac.il/amadeus
29Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
- Supports diverse motif discovery tasks
- Finding over-represented motifs in one or more
given sets of genes. - Identifying motifs with global spatial features
given only the genomic sequences. - Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets. - How?
- A general pipeline architecture for enumerating
motifs. - Different statistical scoring schemes of motifs
for different motif discovery tasks.
30Amadeus - Allegro
Co-regulated gene set
Expression data
Promotersequences
Gene expressionmicroarrays
Cluster I
Clustering
Cluster II
Cluster III
Output
Motif(s)
31Task III Simultaneous inference of motifs
their associated expression profiles
- Input Genome-wide expression profiles
- Motif scoring algorithm Allegro (A
Log-Likelihood based mEthod for Gene expression
Regulatory motifs Over-representation discovery) - Generalization of single condition analysis
- Outline
- Learns expression model that describes the
expression pattern of the motifs putative
targets - The motif is scored for over-representation in
the set of genes whose expression profiles match
the expression model
32Allegro expression model
- Discretization of expression values
Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0
- Expression data should be (partially)
pre-processed, e.g. - Time series ? log ratio relative to time 0
- Several tissues/mutations/ ? standardization
- Do NOT filter out non-responsive genes
- Expression model CWM Condition Weight Matrix
- Non-parametric, log-likelihood based model,
analogous to PWM for sequence motifs - Sensitive, robust against extreme values,
performs well in practice
33Allegro expression model
- Discretization of expression patterns
Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0
- Condition frequency matrix (CFM)
cm c2 c1
0.78 0.1 0.05 U
0.14 0.2 0.9 S
0.08 0.7 0.05 D
- Condition weight matrix (CWM)
( Rrij is the BG CFM)
? Log-likelihood ratio (LLR) score
34Features of the CWM expression model
- Analogous to PWM for sequence motifs
- Non-parametric Does not assume a specific type
of distribution (e.g., Gaussian) for expression
values - Robust against extreme values
- Sensitivity
- Can describe expression profiles that differ from
the BG only in a small subset of the conditions - Can describe the regulatory effect of TFs that
act both as repressors and activators in the same
condition - Performance Describes known modules (GO,
ChIP-chip targets) better than commonly used
metrics Pearson/Spearman correlation, Euclidean
distance
35Allegro overview
36Learning a CWM of a motif
Motif enrich.p-value
Motif
Cross-validation-like procedure to avoid
overfitting
2.0E-5
Microarrays genes
1.5E-9
Motif target genes
3.5E-7
CWMtraining set
CWM
c6 c4 c3 c2 c1
e1
e2
e3
37Compute expression LLR of all genes
- Input (A) CWM F(w)
- (B) Discretized genome-wide expression
profiles
c6 c4 c3 c2 c1
e1
e2
e3
(1)
(2)
(3)
Min. spanning tree
g1 UUSD g2 UDSU g3 UDSU
p1 UUSD p2 UDSU
1
UUSD
UDSD
3
G
P
2
2
1
UDSU
DDSS
2
C
C
Example (Mouse TLRs dataset) G10000
P1442
C38 1.6
38Human cell cycle Whitfield et al., 02
- Large dataset 15,000 genes, 111 conditions,
promoters region -1000200 bps
G1/SS
p-value
E2F
1.3E-19
6.6E-18
CHR
CCAATbox
3.9E-15
G2G2/M
Allegro recovers the major regulators of the
human cell cycle Elkon et al. 03 Tabach et al.
05 Linhart et al. 05.
39Yeast HOG pathway ORourke et al. 04
- 6,000 genes, 133 conditions
- Allegro can discover multiple motifs with diverse
expression patterns, even if the response is in a
small fraction of the conditions - Extant two-step techniques recovered only 4 of
the above motifs - K-means/CLICK Amadeus/Weeder RRPE, PAC, MBF,
STRE - Iclust FIRE RRPE, PAC, Rap1, STRE
40(No Transcript)
41Yeast HOG pathway Comparison with the two-step
pipline
Biologicalprocess Motif/TF K-means / CLICK Iclust Allegro
Biologicalprocess Motif/TF Amadeus / Weeder FIRE Allegro
General stress response RRPE
General stress response PAC
General stress response Rap1 -
HOG and pheromone response pathways Sko1 - -
HOG and pheromone response pathways Ste12 - -
HOG and pheromone response pathways MBF -
HOG and pheromone response pathways Smp1 - - -
HOG and pheromone response pathways Skn7 - - -
General stress response and HOG pathway STRE
42Immune response induced by Toll-like receptors
- 10000 genes, 38 conditions
- Our findings from Elkon et al. 07 were
recovered
p
ISRE
2.0E-22
NF-?B
4.2E-17
2.8E-17
E2F
433 UTR analysis Human stem cells Mueller 08
- 14,000 genes, 124 conditions (various types of
proliferating cells) - Biases in length / GC-content of 3 UTRs, e.g.
- 100 highly-expressed genes in 3 UTR length
GC - Embryoid bodies 584 47
- Undifferentiated ESCs 774 44
- ESC-derived fibroblasts 1240 39
- Fetal NSCs 1422 43
- (ESCs embryonic stem cells, NSCs neural
stem cells) - Extant methods / Allegro with HG score report
only false positives
44Human stem cells results using binned score
miRNA expression
targets expression
Current knowledge
- Most highly expressed miRNAs in human/mouse ESCs
Abundant functional in neural cell lineage
Expressed specifically in neural lineage active
role in neurogenesis
miRNA expression from Laurent 08
45C. elegans germline dataset Reinke et al. 03
- 12,000 genes in 20 different conditions
Co-occurrence p1.3E-54
Hermaphrodite development
Mutants
Germline Hermaphrodite Oogenesis Adult
hermaphrodite
Somatic Male Spermatogenesis L2-L3 hermaphrodite
vs.
46Motif pair features (I)
- Co-occur on the same strand (112 genes vs. 53 ,
p2.5E-6 ) - Order-bias (104 genes vs. 8. p1E-22)
- Distance-bias (p1.12E-34 )
- Gap not conserved. Short flanking regions are
conserved - over-represented in chromosome I (p1.6E-8) and
under-represented in chromosome X (p1E-4) - GO Enrichment embryonic development (sensu
Metazoa) (p1.4E-11), reproduction (p1.1E-8),
hermaphrodite genitalia development (p4.9E-5) ,
etc.
47Motif pair features (II)
Motif pair is specific to the Caenorhabditis genus
48Amadeus/Allegro - Additional features
- Motif pairs analysis
- Joint analysis of multiple datasets
- Evaluation of motifs using several scores
- Bootstrapping get fixed p-value
- Sequence redundancy elimination ignore
sequences with long identical subsequence - User-friendly and informative (most tools are
textual and supply limited information!)
Z
49Co-occurrence of motif pairs
- After postprocess phase
- T - target set. t1, t2 - target genes that
contain hit of the first and second motif,
respectively. t12 - target genes that contain
hits for both motifs
- Elkon et al., 03
- PWMs and their cutoffs are tuned to optimize the
score
50Combining p-values the weighted Z-transform
- Input p-values from k independent test
- H0 all the p-values are uniformly distributed
- transform Pi into standard normal deviates Zi
51Allegro is available at
- Allegro Analyzing expression and sequence in
concert to discover regulatory programs, - Y. Halperin, C. Linhart, I. Ulitsky, R. Shamir,
Nucleic Acids Research, 2009 - (equal contribution)
http//acgt.cs.tau.ac.il/allegro
52Summary
- Developed Amadeus motif discovery platform
- Broad range of applications
- Target gene set
- Spatial features (sequence only)
- Expression analysis - Allegro
- Sensitive efficient
- Easy to use, feature-rich, informative
- New over-representation score to handle biases
in length/GC-content of sequences - Novel expression model - CWM
- Constructed a large, real-life, heterogeneous
benchmark for testing motif finding tools
53Acknowledgements
Tel-Aviv University Chaim Linhart Yonit
Halperin Igor Ulitsky Adi Maron-Katz Ron
Shamir The Hebrew University of Jerusalem Gidi
Weber