Discovering cis-regulatory motifs using genome-wide sequence and expression data - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Discovering cis-regulatory motifs using genome-wide sequence and expression data

Description:

Discovering cis-regulatory motifs using genome-wide sequence and expression data Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 24

Provided by: Home1247

Category:

more less

Transcript and Presenter's Notes

Title: Discovering cis-regulatory motifs using genome-wide sequence and expression data

1
Discovering cis-regulatory motifs using
genome-wide sequence and expression data

Chaim Linhart, Yonit Halperin,
Igor Ulitsky, Ron Shamir

2
Gene expression regulation

Transcription is regulated mainly by
transcription factors (TFs) - proteins that bind
to DNA subsequences, called binding sites (BSs)
TFBSs are located mainly in the genes promoter
the DNA sequence upstream the genes
transcription start site (TSS)
TFs can promote or repress transcription
Other regulators micro-RNAs (miRNAs)

3
TFBS models

The BSs of a particular TF share a common
pattern, or motif, which is often modeled using
Degenerate stringGGWATB (WA,T, BC,G,T)
PWM Position weight matrix

ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA

Cutoff 0.009
AGCTACACCCATTTAT 0.06
AGTAGAGCCTTCGTG 0.06
CGATTCTACAATATGA 0.01

6 5 4 3 2 1
0 0.2 0.7 0 0.8 0.1 A
0.6 0.4 0.1 0.5 0.1 0 C
0.1 0.4 0.1 0.5 0 0 G
0.3 0 0.1 0 0.1 0.9 T
4
Motif discovery The typical two-step pipeline
Promoter/3UTRsequences
Co-regulated gene set
5
Motif discovery Goals and challenges

Goal Reverse-engineer the transcriptional
regulatory network
Challenges
BSs are short and degenerate (non-specific)
Promoters are long complex (hard to model)
Search space is huge (motif and sequence)
Data is noisy
What to look for? (enriched?, localized?,
conserved?)
Problem is still considered very difficult
despite extensive research Tompa 05

6
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species

Supports diverse motif discovery tasks
Finding over-represented motifs in one or more
given sets of genes.
Identifying motifs with global spatial features
given only the genomic sequences.
Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets.
How?
A general pipeline architecture for enumerating
motifs.
Different statistical scoring schemes of motifs
for different motif discovery tasks.

7
Motif search algorithm

Pipeline of refinement phases
Each phase receives best candidates of previous
phase, and refines them
First phases are simple and fast (e.g., try all
k-mers) Last phases are more complex (e.g.,
optimize PWM)

8
PWM optimization phase
9
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species

Supports diverse motif discovery tasks
Finding over-represented motifs in one or more
given sets of genes.
Identifying motifs with global spatial features
given only the genomic sequences.
Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets.
How?
A general pipeline architecture for enumerating
motifs.
Different statistical scoring schemes of motifs
for different motif discovery tasks.

10
Task I Over-represented motifs in given
target set

Input Target set (T) co-regulated genes
Background (BG) set (B) entire genome
No sequence model is assumed!
Motif scoringHypergeometric (HG) enrichment
score
b, t BG/Target genes containing a hit

! BG set should be of the same nature as the
target set, and much largerE.g., all genes on
microarray
11
Task I Over-represented motifs in given
target set

Input Target set (T) co-regulated genes
Background (BG) set (B) entire genome
No sequence model is assumed!
Motif scoringHypergeometric (HG) enrichment
score
b, t BG/Target genes containing a hit

12
Drawback of the HG score

Length/GC-content distribution in the target set
might significantly differ from the distribution
in the BG set
Very common in practice due to correlation
between the expression/function of genes and the
length/GC-content of their promoters and 3 UTRs
The HG score might fail to discover the correct
motif or detect many spurious motifs
? Use the binned enrichment score
Slightly less sensitive than HG score
but takes into account length/GC-content biases

13
Drawback of the HG score

Length/GC-content distribution in the target set
might significantly differ from the distribution
in the BG set
Very common in practice due to correlation
between the expression/function of genes and the
length/GC-content of their promoters and 3 UTRs
H0 assumes uniform sampling
The HG score might fail to discover the correct
motif or detect many spurious motifs

14
Binned enrichment score
GC-content

Key idea Binning sequences
Bi, Ti BG/Target genes in i-th bin
bimotif hits in i-th bin. t bnT
Bins sampling probability
Assume uniform sampling per bin

Length

pm prob. of a target set gene to contain a hit

Assume that T target genes are sampled with
replacement from B

15
Test case Human G2M cell-cycle genes

Input 350 genes expressed in the human G2M
cell-cycle phases Whitfield et al. 02

CHR
Pairs analysis
NF-Y (CCAAT-box)

These motifs form a module associated with G2M
Elkon et al. 03 ,Tabach et al. 05, Linhart et
al. 05

16
Results Human G2M cell-cycle genes

350 genes expressed in the human G2M cell-cycle
phases Whitfield et al. 02.

CHR
CCAAT-box

Both motif are associated with G2M Elkon et al.
03 ,Tabach et al. 05, Linhart et al. 05.

17
Benchmark I Yeast TF target sets

Source ChIP-chip Harbison et al., 04
Data 173 target-sets of 83 TFs with known BS
motifs
Average set size 58 genes (35 Kbps)
Success rates (for top 2 motifs of lengths 8
10)

18
BenchmarkReal-life metazoan datasets

We constructed the first motif discovery
benchmark that is based on a large compendium of
experimental studies
Source Various (expression, ChIP-chip, Gene
Ontology, )
Data 42 target-sets of 26 TFs and 8 miRNAs from
29 publications
Species human, mouse,
fly, worm
Average set size
400 genes (383 Kbps)

Binned score improvement
19
Similarity between two motifs

Euclidean
Pearson correlation coefficient
Kullback-Leibler divergence (relative entropy)

20
Metazoan benchmarkOther assessment methods for
success rate
21
Metazoan benchmark Detailed results
22
Binned score - examples

Mef2 fly target-set Blais et al. 05
Promoters longer than average (972bp vs. 840bp)
Promoter have higher GC-content (53 vs. 49)
None of the programs discovered the correct motif
Binned score -gt Mef2 is the top-scoring motif
hsa-miR-16 target-set Linsley et al. 07
3UTRs longer than average (1700bp vs. 960)
HG score hsa-miR-16 signature is the top scoring
motif. But 10 more motifs with plt1E-14.
Binned score the correct motif p1.7E-33. No
spurious motifs.
198 mouse odorant receptors promoters Michaloski
et al. 06
highly AT-rich (35 vs. 25 in the BG)
HG score Olf-1 was the third best motif after
AT-rich motifs
Binned score Olf-1 top scoring motifs

23
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species

Supports diverse motif discovery tasks
Finding over-represented motifs in one or more
given sets of genes.
Identifying motifs with global spatial features
given only the genomic sequences.
Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets.
How?
A general pipeline architecture for enumerating
motifs.
Different statistical scoring schemes of motifs
for different motif discovery tasks.

24
Amadeus Global spatial analysis
Co-regulated gene set
Gene expressionmicroarrays
Location analysis (ChIP-chip, )
Promotersequences
Functional group (e.g., GO term)
Output
Motif(s)
25
Task II Global analyses
Scores for spatial features of motif
occurrences Input Sequences (no target-set /
expression data)
Motif scoring

Localization w.r.t the TSS
Strand-bias
Chromosomal preference

26
Global analysis ILocalized human mouse motifs

Input
All human mouse promoters (2 x 20,000)
Score localization

27
Global analysis IIChromosomal preference in C.
elegans

Input
All worm promoters (18,000)
Score chromosomal preference

Results Novel motif on chrom IV
28
Amadeus is available at

Transcription factor and microRNA motif
discovery The Amadeus platform and a compendium
of metazoan target sets,
C. Linhart, Y. Halperin, R. Shamir, Genome
Research 187, 2008
(equal contribution)

http//acgt.cs.tau.ac.il/amadeus
29
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species

Supports diverse motif discovery tasks
Finding over-represented motifs in one or more
given sets of genes.
Identifying motifs with global spatial features
given only the genomic sequences.
Simultaneous inference of motifs and their
associated expression profiles given genome-wide
expression datasets.
How?
A general pipeline architecture for enumerating
motifs.
Different statistical scoring schemes of motifs
for different motif discovery tasks.

30
Amadeus - Allegro
Co-regulated gene set
Expression data
Promotersequences
Gene expressionmicroarrays
Cluster I
Clustering
Cluster II
Cluster III
Output
Motif(s)
31
Task III Simultaneous inference of motifs
their associated expression profiles

Input Genome-wide expression profiles
Motif scoring algorithm Allegro (A
Log-Likelihood based mEthod for Gene expression
Regulatory motifs Over-representation discovery)
Generalization of single condition analysis
Outline
Learns expression model that describes the
expression pattern of the motifs putative
targets
The motif is scored for over-representation in
the set of genes whose expression profiles match
the expression model

32
Allegro expression model

Discretization of expression values

Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0

Expression data should be (partially)
pre-processed, e.g.
Time series ? log ratio relative to time 0
Several tissues/mutations/ ? standardization
Do NOT filter out non-responsive genes
Expression model CWM Condition Weight Matrix
Non-parametric, log-likelihood based model,
analogous to PWM for sequence motifs
Sensitive, robust against extreme values,
performs well in practice

33
Allegro expression model

Discretization of expression patterns

Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0

Condition frequency matrix (CFM)

cm c2 c1
0.78 0.1 0.05 U
0.14 0.2 0.9 S
0.08 0.7 0.05 D

Condition weight matrix (CWM)

( Rrij is the BG CFM)
? Log-likelihood ratio (LLR) score
34
Features of the CWM expression model

Analogous to PWM for sequence motifs
Non-parametric Does not assume a specific type
of distribution (e.g., Gaussian) for expression
values
Robust against extreme values
Sensitivity
Can describe expression profiles that differ from
the BG only in a small subset of the conditions
Can describe the regulatory effect of TFs that
act both as repressors and activators in the same
condition
Performance Describes known modules (GO,
ChIP-chip targets) better than commonly used
metrics Pearson/Spearman correlation, Euclidean
distance

35
Allegro overview
36
Learning a CWM of a motif
Motif enrich.p-value
Motif
Cross-validation-like procedure to avoid
overfitting
2.0E-5
Microarrays genes
1.5E-9
Motif target genes
3.5E-7
CWMtraining set
CWM
c6 c4 c3 c2 c1
e1
e2
e3
37
Compute expression LLR of all genes

Input (A) CWM F(w)
(B) Discretized genome-wide expression
profiles

c6 c4 c3 c2 c1
e1
e2
e3
(1)
(2)
(3)
Min. spanning tree
g1 UUSD g2 UDSU g3 UDSU
p1 UUSD p2 UDSU
1
UUSD
UDSD
3
G
P
2
2
1
UDSU
DDSS
2
C
C
Example (Mouse TLRs dataset) G10000
P1442
C38 1.6
38
Human cell cycle Whitfield et al., 02

Large dataset 15,000 genes, 111 conditions,
promoters region -1000200 bps

G1/SS
p-value
E2F
1.3E-19
6.6E-18
CHR
CCAATbox
3.9E-15
G2G2/M
Allegro recovers the major regulators of the
human cell cycle Elkon et al. 03 Tabach et al.
05 Linhart et al. 05.
39
Yeast HOG pathway ORourke et al. 04

6,000 genes, 133 conditions

Allegro can discover multiple motifs with diverse
expression patterns, even if the response is in a
small fraction of the conditions
Extant two-step techniques recovered only 4 of
the above motifs
K-means/CLICK Amadeus/Weeder RRPE, PAC, MBF,
STRE
Iclust FIRE RRPE, PAC, Rap1, STRE

40
(No Transcript)
41
Yeast HOG pathway Comparison with the two-step
pipline
Biologicalprocess Motif/TF K-means / CLICK Iclust Allegro
Biologicalprocess Motif/TF Amadeus / Weeder FIRE Allegro
General stress response RRPE
General stress response PAC
General stress response Rap1 -
HOG and pheromone response pathways Sko1 - -
HOG and pheromone response pathways Ste12 - -
HOG and pheromone response pathways MBF -
HOG and pheromone response pathways Smp1 - - -
HOG and pheromone response pathways Skn7 - - -
General stress response and HOG pathway STRE
42
Immune response induced by Toll-like receptors

10000 genes, 38 conditions
Our findings from Elkon et al. 07 were
recovered

p
ISRE
2.0E-22
NF-?B
4.2E-17
2.8E-17
E2F
43
3 UTR analysis Human stem cells Mueller 08

14,000 genes, 124 conditions (various types of
proliferating cells)
Biases in length / GC-content of 3 UTRs, e.g.
100 highly-expressed genes in 3 UTR length
GC
Embryoid bodies 584 47
Undifferentiated ESCs 774 44
ESC-derived fibroblasts 1240 39
Fetal NSCs 1422 43
(ESCs embryonic stem cells, NSCs neural
stem cells)
Extant methods / Allegro with HG score report
only false positives

44
Human stem cells results using binned score
miRNA expression
targets expression
Current knowledge

Most highly expressed miRNAs in human/mouse ESCs

Abundant functional in neural cell lineage
Expressed specifically in neural lineage active
role in neurogenesis
miRNA expression from Laurent 08
45
C. elegans germline dataset Reinke et al. 03

12,000 genes in 20 different conditions

Co-occurrence p1.3E-54
Hermaphrodite development
Mutants
Germline Hermaphrodite Oogenesis Adult
hermaphrodite
Somatic Male Spermatogenesis L2-L3 hermaphrodite
vs.
46
Motif pair features (I)

Co-occur on the same strand (112 genes vs. 53 ,
p2.5E-6 )
Order-bias (104 genes vs. 8. p1E-22)
Distance-bias (p1.12E-34 )
Gap not conserved. Short flanking regions are
conserved
over-represented in chromosome I (p1.6E-8) and
under-represented in chromosome X (p1E-4)
GO Enrichment embryonic development (sensu
Metazoa) (p1.4E-11), reproduction (p1.1E-8),
hermaphrodite genitalia development (p4.9E-5) ,
etc.

47
Motif pair features (II)
Motif pair is specific to the Caenorhabditis genus
48
Amadeus/Allegro - Additional features

Motif pairs analysis
Joint analysis of multiple datasets
Evaluation of motifs using several scores
Bootstrapping get fixed p-value
Sequence redundancy elimination ignore
sequences with long identical subsequence
User-friendly and informative (most tools are
textual and supply limited information!)

Z
49
Co-occurrence of motif pairs

After postprocess phase
T - target set. t1, t2 - target genes that
contain hit of the first and second motif,
respectively. t12 - target genes that contain
hits for both motifs

Elkon et al., 03
PWMs and their cutoffs are tuned to optimize the
score

50
Combining p-values the weighted Z-transform

Input p-values from k independent test
H0 all the p-values are uniformly distributed
transform Pi into standard normal deviates Zi

Combined p-value

51
Allegro is available at

Allegro Analyzing expression and sequence in
concert to discover regulatory programs,
Y. Halperin, C. Linhart, I. Ulitsky, R. Shamir,
Nucleic Acids Research, 2009
(equal contribution)

http//acgt.cs.tau.ac.il/allegro
52
Summary

Developed Amadeus motif discovery platform
Broad range of applications
Target gene set
Spatial features (sequence only)
Expression analysis - Allegro
Sensitive efficient
Easy to use, feature-rich, informative
New over-representation score to handle biases
in length/GC-content of sequences
Novel expression model - CWM
Constructed a large, real-life, heterogeneous
benchmark for testing motif finding tools

53
Acknowledgements
Tel-Aviv University Chaim Linhart Yonit
Halperin Igor Ulitsky Adi Maron-Katz Ron
Shamir The Hebrew University of Jerusalem Gidi
Weber

Write a Comment

User Comments (0)