Discovering cis-regulatory motifs using genome-wide sequence and expression data - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Discovering cis-regulatory motifs using genome-wide sequence and expression data

Description:

Discovering cis-regulatory motifs using genome-wide sequence and expression data Chaim Linhart, Yonit Halperin, Igor Ulitsky, Ron Shamir – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 24
Provided by: Home1247
Category:

less

Transcript and Presenter's Notes

Title: Discovering cis-regulatory motifs using genome-wide sequence and expression data


1
Discovering cis-regulatory motifs using
genome-wide sequence and expression data
  • Chaim Linhart, Yonit Halperin,
  • Igor Ulitsky, Ron Shamir

2
Gene expression regulation
  • Transcription is regulated mainly by
    transcription factors (TFs) - proteins that bind
    to DNA subsequences, called binding sites (BSs)
  • TFBSs are located mainly in the genes promoter
    the DNA sequence upstream the genes
    transcription start site (TSS)
  • TFs can promote or repress transcription
  • Other regulators micro-RNAs (miRNAs)

3
TFBS models
  • The BSs of a particular TF share a common
    pattern, or motif, which is often modeled using
  • Degenerate stringGGWATB (WA,T, BC,G,T)
  • PWM Position weight matrix

ATCGGAATTCTGCAG GGCAATTCGGGAATG AGGTATTCTCAGATTA
  • Cutoff 0.009
  • AGCTACACCCATTTAT 0.06
  • AGTAGAGCCTTCGTG 0.06
  • CGATTCTACAATATGA 0.01

6 5 4 3 2 1
0 0.2 0.7 0 0.8 0.1 A
0.6 0.4 0.1 0.5 0.1 0 C
0.1 0.4 0.1 0.5 0 0 G
0.3 0 0.1 0 0.1 0.9 T
4
Motif discovery The typical two-step pipeline
Promoter/3UTRsequences
Co-regulated gene set
5
Motif discovery Goals and challenges
  • Goal Reverse-engineer the transcriptional
    regulatory network
  • Challenges
  • BSs are short and degenerate (non-specific)
  • Promoters are long complex (hard to model)
  • Search space is huge (motif and sequence)
  • Data is noisy
  • What to look for? (enriched?, localized?,
    conserved?)
  • Problem is still considered very difficult
    despite extensive research Tompa 05

6
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
  • Supports diverse motif discovery tasks
  • Finding over-represented motifs in one or more
    given sets of genes.
  • Identifying motifs with global spatial features
    given only the genomic sequences.
  • Simultaneous inference of motifs and their
    associated expression profiles given genome-wide
    expression datasets.
  • How?
  • A general pipeline architecture for enumerating
    motifs.
  • Different statistical scoring schemes of motifs
    for different motif discovery tasks.

7
Motif search algorithm
  • Pipeline of refinement phases
  • Each phase receives best candidates of previous
    phase, and refines them
  • First phases are simple and fast (e.g., try all
    k-mers) Last phases are more complex (e.g.,
    optimize PWM)

8
PWM optimization phase
9
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
  • Supports diverse motif discovery tasks
  • Finding over-represented motifs in one or more
    given sets of genes.
  • Identifying motifs with global spatial features
    given only the genomic sequences.
  • Simultaneous inference of motifs and their
    associated expression profiles given genome-wide
    expression datasets.
  • How?
  • A general pipeline architecture for enumerating
    motifs.
  • Different statistical scoring schemes of motifs
    for different motif discovery tasks.

10
Task I Over-represented motifs in given
target set
  • Input Target set (T) co-regulated genes
    Background (BG) set (B) entire genome
  • No sequence model is assumed!
  • Motif scoringHypergeometric (HG) enrichment
    score
  • b, t BG/Target genes containing a hit

! BG set should be of the same nature as the
target set, and much largerE.g., all genes on
microarray
11
Task I Over-represented motifs in given
target set
  • Input Target set (T) co-regulated genes
    Background (BG) set (B) entire genome
  • No sequence model is assumed!
  • Motif scoringHypergeometric (HG) enrichment
    score
  • b, t BG/Target genes containing a hit

12
Drawback of the HG score
  • Length/GC-content distribution in the target set
    might significantly differ from the distribution
    in the BG set
  • Very common in practice due to correlation
    between the expression/function of genes and the
    length/GC-content of their promoters and 3 UTRs
  • The HG score might fail to discover the correct
    motif or detect many spurious motifs
  • ? Use the binned enrichment score
  • Slightly less sensitive than HG score
  • but takes into account length/GC-content biases

13
Drawback of the HG score
  • Length/GC-content distribution in the target set
    might significantly differ from the distribution
    in the BG set
  • Very common in practice due to correlation
    between the expression/function of genes and the
    length/GC-content of their promoters and 3 UTRs
  • H0 assumes uniform sampling
  • The HG score might fail to discover the correct
    motif or detect many spurious motifs

14
Binned enrichment score
GC-content
  • Key idea Binning sequences
  • Bi, Ti BG/Target genes in i-th bin
  • bimotif hits in i-th bin. t bnT
  • Bins sampling probability
  • Assume uniform sampling per bin

Length
  • pm prob. of a target set gene to contain a hit
  • Assume that T target genes are sampled with
    replacement from B

15
Test case Human G2M cell-cycle genes
  • Input 350 genes expressed in the human G2M
    cell-cycle phases Whitfield et al. 02

CHR
Pairs analysis
NF-Y (CCAAT-box)
  • These motifs form a module associated with G2M
    Elkon et al. 03 ,Tabach et al. 05, Linhart et
    al. 05

16
Results Human G2M cell-cycle genes
  • 350 genes expressed in the human G2M cell-cycle
    phases Whitfield et al. 02.

CHR
CCAAT-box
  • Both motif are associated with G2M Elkon et al.
    03 ,Tabach et al. 05, Linhart et al. 05.

17
Benchmark I Yeast TF target sets
  • Source ChIP-chip Harbison et al., 04
  • Data 173 target-sets of 83 TFs with known BS
    motifs
  • Average set size 58 genes (35 Kbps)
  • Success rates (for top 2 motifs of lengths 8
    10)

18
BenchmarkReal-life metazoan datasets
  • We constructed the first motif discovery
    benchmark that is based on a large compendium of
    experimental studies
  • Source Various (expression, ChIP-chip, Gene
    Ontology, )
  • Data 42 target-sets of 26 TFs and 8 miRNAs from
    29 publications
  • Species human, mouse,
  • fly, worm
  • Average set size
  • 400 genes (383 Kbps)

Binned score improvement
19
Similarity between two motifs
  • Euclidean
  • Pearson correlation coefficient
  • Kullback-Leibler divergence (relative entropy)

20
Metazoan benchmarkOther assessment methods for
success rate
21
Metazoan benchmark Detailed results
22
Binned score - examples
  • Mef2 fly target-set Blais et al. 05
  • Promoters longer than average (972bp vs. 840bp)
  • Promoter have higher GC-content (53 vs. 49)
  • None of the programs discovered the correct motif
  • Binned score -gt Mef2 is the top-scoring motif
  • hsa-miR-16 target-set Linsley et al. 07
  • 3UTRs longer than average (1700bp vs. 960)
  • HG score hsa-miR-16 signature is the top scoring
    motif. But 10 more motifs with plt1E-14.
  • Binned score the correct motif p1.7E-33. No
    spurious motifs.
  • 198 mouse odorant receptors promoters Michaloski
    et al. 06
  • highly AT-rich (35 vs. 25 in the BG)
  • HG score Olf-1 was the third best motif after
    AT-rich motifs
  • Binned score Olf-1 top scoring motifs

23
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
  • Supports diverse motif discovery tasks
  • Finding over-represented motifs in one or more
    given sets of genes.
  • Identifying motifs with global spatial features
    given only the genomic sequences.
  • Simultaneous inference of motifs and their
    associated expression profiles given genome-wide
    expression datasets.
  • How?
  • A general pipeline architecture for enumerating
    motifs.
  • Different statistical scoring schemes of motifs
    for different motif discovery tasks.

24
Amadeus Global spatial analysis
Co-regulated gene set
Gene expressionmicroarrays
Location analysis (ChIP-chip, )
Promotersequences
Functional group (e.g., GO term)
Output
Motif(s)
25
Task II Global analyses
Scores for spatial features of motif
occurrences Input Sequences (no target-set /
expression data)
Motif scoring
  • Localization w.r.t the TSS
  • Strand-bias
  • Chromosomal preference

26
Global analysis ILocalized human mouse motifs
  • Input
  • All human mouse promoters (2 x 20,000)
  • Score localization

27
Global analysis IIChromosomal preference in C.
elegans
  • Input
  • All worm promoters (18,000)
  • Score chromosomal preference

Results Novel motif on chrom IV
28
Amadeus is available at
  • Transcription factor and microRNA motif
    discovery The Amadeus platform and a compendium
    of metazoan target sets,
  • C. Linhart, Y. Halperin, R. Shamir, Genome
    Research 187, 2008
  • (equal contribution)

http//acgt.cs.tau.ac.il/amadeus
29
Amadeus A Motif Algorithm for Detecting
Enrichment in mUltiple Species
  • Supports diverse motif discovery tasks
  • Finding over-represented motifs in one or more
    given sets of genes.
  • Identifying motifs with global spatial features
    given only the genomic sequences.
  • Simultaneous inference of motifs and their
    associated expression profiles given genome-wide
    expression datasets.
  • How?
  • A general pipeline architecture for enumerating
    motifs.
  • Different statistical scoring schemes of motifs
    for different motif discovery tasks.

30
Amadeus - Allegro
Co-regulated gene set
Expression data
Promotersequences
Gene expressionmicroarrays
Cluster I
Clustering
Cluster II
Cluster III
Output
Motif(s)
31
Task III Simultaneous inference of motifs
their associated expression profiles
  • Input Genome-wide expression profiles
  • Motif scoring algorithm Allegro (A
    Log-Likelihood based mEthod for Gene expression
    Regulatory motifs Over-representation discovery)
  • Generalization of single condition analysis
  • Outline
  • Learns expression model that describes the
    expression pattern of the motifs putative
    targets
  • The motif is scored for over-representation in
    the set of genes whose expression profiles match
    the expression model

32
Allegro expression model
  • Discretization of expression values

Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0
  • Expression data should be (partially)
    pre-processed, e.g.
  • Time series ? log ratio relative to time 0
  • Several tissues/mutations/ ? standardization
  • Do NOT filter out non-responsive genes
  • Expression model CWM Condition Weight Matrix
  • Non-parametric, log-likelihood based model,
    analogous to PWM for sequence motifs
  • Sensitive, robust against extreme values,
    performs well in practice

33
Allegro expression model
  • Discretization of expression patterns

Discrete expression Pattern (DEP)
Expression pattern
e1Up (U)
1.0
e2Same (S)
(-1.0, 1.0)
cm c2 c1
1.5 -0.8 -2.3 g
cm c2 c1
U S D g
e3Down (D)
-1.0
  • Condition frequency matrix (CFM)

cm c2 c1
0.78 0.1 0.05 U
0.14 0.2 0.9 S
0.08 0.7 0.05 D
  • Condition weight matrix (CWM)

( Rrij is the BG CFM)
? Log-likelihood ratio (LLR) score
34
Features of the CWM expression model
  • Analogous to PWM for sequence motifs
  • Non-parametric Does not assume a specific type
    of distribution (e.g., Gaussian) for expression
    values
  • Robust against extreme values
  • Sensitivity
  • Can describe expression profiles that differ from
    the BG only in a small subset of the conditions
  • Can describe the regulatory effect of TFs that
    act both as repressors and activators in the same
    condition
  • Performance Describes known modules (GO,
    ChIP-chip targets) better than commonly used
    metrics Pearson/Spearman correlation, Euclidean
    distance

35
Allegro overview
36
Learning a CWM of a motif
Motif enrich.p-value
Motif
Cross-validation-like procedure to avoid
overfitting
2.0E-5
Microarrays genes
1.5E-9
Motif target genes
3.5E-7
CWMtraining set
CWM
c6 c4 c3 c2 c1
e1
e2
e3
37
Compute expression LLR of all genes
  • Input (A) CWM F(w)
  • (B) Discretized genome-wide expression
    profiles

c6 c4 c3 c2 c1
e1
e2
e3
(1)
(2)
(3)
Min. spanning tree
g1 UUSD g2 UDSU g3 UDSU
p1 UUSD p2 UDSU
1
UUSD
UDSD
3
G
P
2
2
1
UDSU
DDSS
2
C
C
Example (Mouse TLRs dataset) G10000
P1442
C38 1.6
38
Human cell cycle Whitfield et al., 02
  • Large dataset 15,000 genes, 111 conditions,
    promoters region -1000200 bps

G1/SS
p-value
E2F
1.3E-19
6.6E-18
CHR
CCAATbox
3.9E-15
G2G2/M
Allegro recovers the major regulators of the
human cell cycle Elkon et al. 03 Tabach et al.
05 Linhart et al. 05.
39
Yeast HOG pathway ORourke et al. 04
  • 6,000 genes, 133 conditions
  • Allegro can discover multiple motifs with diverse
    expression patterns, even if the response is in a
    small fraction of the conditions
  • Extant two-step techniques recovered only 4 of
    the above motifs
  • K-means/CLICK Amadeus/Weeder RRPE, PAC, MBF,
    STRE
  • Iclust FIRE RRPE, PAC, Rap1, STRE

40
(No Transcript)
41
Yeast HOG pathway Comparison with the two-step
pipline
Biologicalprocess Motif/TF K-means / CLICK Iclust Allegro
Biologicalprocess Motif/TF Amadeus / Weeder FIRE Allegro
General stress response RRPE
General stress response PAC
General stress response Rap1 -
HOG and pheromone response pathways Sko1 - -
HOG and pheromone response pathways Ste12 - -
HOG and pheromone response pathways MBF -
HOG and pheromone response pathways Smp1 - - -
HOG and pheromone response pathways Skn7 - - -
General stress response and HOG pathway STRE
42
Immune response induced by Toll-like receptors
  • 10000 genes, 38 conditions
  • Our findings from Elkon et al. 07 were
    recovered

p
ISRE
2.0E-22
NF-?B
4.2E-17
2.8E-17
E2F
43
3 UTR analysis Human stem cells Mueller 08
  • 14,000 genes, 124 conditions (various types of
    proliferating cells)
  • Biases in length / GC-content of 3 UTRs, e.g.
  • 100 highly-expressed genes in 3 UTR length
    GC
  • Embryoid bodies 584 47
  • Undifferentiated ESCs 774 44
  • ESC-derived fibroblasts 1240 39
  • Fetal NSCs 1422 43
  • (ESCs embryonic stem cells, NSCs neural
    stem cells)
  • Extant methods / Allegro with HG score report
    only false positives

44
Human stem cells results using binned score
miRNA expression
targets expression
Current knowledge
  • Most highly expressed miRNAs in human/mouse ESCs

Abundant functional in neural cell lineage
Expressed specifically in neural lineage active
role in neurogenesis
miRNA expression from Laurent 08
45
C. elegans germline dataset Reinke et al. 03
  • 12,000 genes in 20 different conditions

Co-occurrence p1.3E-54
Hermaphrodite development
Mutants
Germline Hermaphrodite Oogenesis Adult
hermaphrodite
Somatic Male Spermatogenesis L2-L3 hermaphrodite
vs.
46
Motif pair features (I)
  • Co-occur on the same strand (112 genes vs. 53 ,
    p2.5E-6 )
  • Order-bias (104 genes vs. 8. p1E-22)
  • Distance-bias (p1.12E-34 )
  • Gap not conserved. Short flanking regions are
    conserved
  • over-represented in chromosome I (p1.6E-8) and
    under-represented in chromosome X (p1E-4)
  • GO Enrichment embryonic development (sensu
    Metazoa) (p1.4E-11), reproduction (p1.1E-8),
    hermaphrodite genitalia development (p4.9E-5) ,
    etc.


47
Motif pair features (II)
Motif pair is specific to the Caenorhabditis genus
48
Amadeus/Allegro - Additional features
  • Motif pairs analysis
  • Joint analysis of multiple datasets
  • Evaluation of motifs using several scores
  • Bootstrapping get fixed p-value
  • Sequence redundancy elimination ignore
    sequences with long identical subsequence
  • User-friendly and informative (most tools are
    textual and supply limited information!)

Z
49
Co-occurrence of motif pairs
  • After postprocess phase
  • T - target set. t1, t2 - target genes that
    contain hit of the first and second motif,
    respectively. t12 - target genes that contain
    hits for both motifs
  • Elkon et al., 03
  • PWMs and their cutoffs are tuned to optimize the
    score

50
Combining p-values the weighted Z-transform
  • Input p-values from k independent test
  • H0 all the p-values are uniformly distributed
  • transform Pi into standard normal deviates Zi
  • Combined p-value

51
Allegro is available at
  • Allegro Analyzing expression and sequence in
    concert to discover regulatory programs,
  • Y. Halperin, C. Linhart, I. Ulitsky, R. Shamir,
    Nucleic Acids Research, 2009
  • (equal contribution)

http//acgt.cs.tau.ac.il/allegro
52
Summary
  • Developed Amadeus motif discovery platform
  • Broad range of applications
  • Target gene set
  • Spatial features (sequence only)
  • Expression analysis - Allegro
  • Sensitive efficient
  • Easy to use, feature-rich, informative
  • New over-representation score to handle biases
    in length/GC-content of sequences
  • Novel expression model - CWM
  • Constructed a large, real-life, heterogeneous
    benchmark for testing motif finding tools

53
Acknowledgements
Tel-Aviv University Chaim Linhart Yonit
Halperin Igor Ulitsky Adi Maron-Katz Ron
Shamir The Hebrew University of Jerusalem Gidi
Weber
Write a Comment
User Comments (0)
About PowerShow.com