Title: Deciphering Gene Regulatory Networks by in silico approaches
1Deciphering Gene Regulatory Networks by in silico
approaches
Sridhar Hannenhalli Penn Center for
Bioinformatics Department of Genetics University
of Pennsylvania
2Transcriptional Regulation
Transcription Start Site
Interactions and Modules
TF-DNA binding
3Overview
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
4Overview
Identification Representation Discovery
(motif-discovery) Search Ambiguity/Redundancy
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
5Binding site identification
SELEX
ATACGGT ATACCGT ATCGGCA AAAGGCT
CONSENSUS A T A S G S T
ChIP-chip
Deletion/Mutation
Specificity
WEIGHT MATRIX 1.2 0.0 0.96 -1.6 -1.6 -1.6
0.0 -1.6 -1.6 0.0 0.59 0.0 0.59 -1.6 -1.6
-1.6 -1.6 0.59 0.96 0.59 -1.6 -1.6 0.96
-1.6 -1.6 -1.6 -1.6 0.96
6Binding site search
- TFs often bind to short and degenerate DNA
sequences, leading to false positives - Evolutionary conservation (phylogenetic
footprinting/shadowing) can help reduce the false
positives - About half of the functional binding sites are
not conserved - A combination of evolutionary conservation and
binding site score can detects 70 of the
experimentally verified binding sites at a False
Positive rate of 1/50kb per PWM (Levy and
Hannenhalli, Mammalian Genome, 2002)
TRANSFAC/JASPAR PWM
Human genome
Multi-species conservation
7Non-Independence of binding site positions
- Bacteriophage Mnt prefers binding to C, instead
of wild-type A, at position 16 when wild-type C
at position 17 is changed to other bases. (Man
and Stormo, 2001, NAR) - Barash, Elidan, Freidman, Kaplan, 2003, RECOMB
- Osada, Zaslavsky and Singh, 2004, Bioinformatics
8Binding site representation
ATACGGT ATACCGT CGCGGCA CGAGCCT
WEIGHT MATRIX 1.2 0.0 0.96 -1.6 -1.6 -1.6
0.0 -1.6 -1.6 0.0 0.59 0.0 0.59 -1.6 -1.6
-1.6 -1.6 0.59 0.96 0.59 -1.6 -1.6 0.96
-1.6 -1.6 -1.6 -1.6 0.96
Assumption of positional independence
ATACGGT ATACCGT CGCGGCA CGAGCCT
A PSPA or Variable length Markov Model of binding
sites is superior to the PWM model
- For 95 JASPAR PWMs, PSPAM is better in 48 cases
and worse in 6 cases at significant level of 0.05.
9Conservation patterns in cis-elements reveal
inter-position dependence
Human .ACCGTGT.ACCTTCT.. Chimp .AGCGT
GT.ACCTTGT.. Mouse .TCGGTGA.TGCTTCT
.. Rat .CCCGTGA.AGCTTGT.. Dog .TCGG
TCT.ACCCTCT..
G G G G C
C C G C G
103
N (binding sites)
2
1
X
Y
X
Y
X
Y
X
Y
Pr(X) probability of X using standard tree
Markov process Pr(XY) probability of X
dependent on corresponding Y branches
Compensatory Mutation SXY fraction of sites
for which Pr(X Y) gt Pr(X)
Scope X Y
11SX,X1 for 79 vertebrate PWMs from JASPAR
Control-1 Randomly select i, j pairs. Control-2
Randomly select i and then select jis.
Control-3 constructs PWM Mr with same width as M
by randomly sampling columns from the 79
vertebrate PWMs in JASPAR. Control-4 Construct
PWM Mr from M by randomly shuffling the
compositions at each column (position).
12SX,Xs decreases with increasing scope s.
However it remains significantly greater than the
respective control-4 up to scope 6
13Functional relevance of positions with
compensatory mutation
14Evans, Donahue, Hannenhalli, RECOMB-Comparative
Genomics 2006
15Binding site Ambiguity/Redundancy
- Several transcription factors have distinct PWMs
- Several distinct transcription factors have very
similar PWMs
ACCGTGTTT ACCGACTTT ACCGTGAAT ACCGTGTTT TCCGTGTTT
TCAGTGTTT TCTGTGTTT TCGGTGTTT
PWM1
PWM
PWM2
16Enhancing Positional Weight Matrices using
Mixture models
- A mixture model allowing an arbitrary number of
base PWM
Given mixture
the probability of observing sequence Xi
(Xi1,, Xin) is
Use EM algorithm to estimate subclasses We use
k2 base class PWMs (due to lack of data and lack
of knowledge of appropriate number of classes)
Hannenhalli and Wang, Bioinformatics, 2005
17Sequence conservation of binding sites using
Mixture model
48
39
23
- Based on 64 Vertebrate TF entries in JASPAR
database
18Subclass Dissimilarity vs Prediction Improvement
Less dissimilar
More dissimilar
1939 36 30 23 15
13
64 57 44 32 20
16
Relative entropy between two base PWMs
20Expression Coherence of target genes using
mixture model
EC of a set of genes is the fraction of
gene-pairs whose expressions across several
tissues/conditions are very similar
PWM1
PWM2
Is the intra-class EC higher than inter-class EC?
In 44 of the 55 (80) cases, the average
expression coherence within subclass-PWM targets
was higher than expression coherence of across
subclass targets. In all but one cases (98) at
least one of the two subclass PWMs had a
coherence score higher than the cross coherence
score.
Hannenhalli and Wang, Bioinformatics, 2005
21LEU3 Dataset Liu et al., 2002
- Free energy of binding available for 46 observed
binding sites of LEU3 Liu et al., 2002 - The two clusters from the EM algorithm have
significantly different binding energies.
22Yeast Reb1
- Using the mixture modeling on the 15 known REB1
sites from TRANSFAC, we find the last position to
be such that - 1st subclass-PWM has Pr(G)0.85, Pr(T)0.15
- 2nd subclass-PWM has Pr(G)0 and Pr(T)0.5
Tanay et al, 2004, GR Wang and Warner, 1998, Mol
Cell Biol
23Bi-clustering based modeling
Vertical Partitioning
Vertical partitioning
ACCGTCTCAA ACCGTGTGAA AGCGTGCCCT ACGGTGCCCA TGGCCG
CCGA TCGCACTCTT TGCCCCTGCT TGGCCCTCTT
ATACGGT ATACCGT CGCGGCA CGAGCCT
III
I
IV
Horizontal Partitioning
ACCGTGTTT ACCGACTTT ACCGTGAAT ACCGTGTTT TCCGTGTTT
TCAGTGTTT TCTGTGTTT TCGGTGTTT
II
V
Horizontal partitioning
24Context-dependent binding specificity
X
Y
X
Z
X
25Binding site Ambiguity/Redundancy
- Several transcription factors have distinct PWMs
- Several distinct transcription factors have very
similar PWMs
26TESS
2732 Class
1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0 -1.6
-1.6 0.0 0.59 0.0 0.59 -1.6 -1.6 -1.6
-1.6 0.59 0.96 0.59 -1.6 -1.6 0.96 -1.6
-1.6 -1.6 -1.6 0.96
1.2 0.0 0.96 -1.6 -1.6 -1.6 0.0 -1.6
-1.6 0.0 0.59 0.0 0.59 -1.6 -1.6 -1.6
-1.6 0.59 0.96 0.59 -1.6 -1.6 0.96 -1.6
-1.6 -1.6 -1.6 0.96
80 Family
117 Subfamily
1034 factors
28Once upon a time a transcription factor gene was
duplicated
DNA Binding Domain
Interaction Domain
Promoter
Conserved DBD
Divergent nDBD
Redundant paralogs
Divergent Expression
Divergent Promoter
29Hypothesis Homologous TF-pairs with similar DBD
have diverged in expression. Control
Homologous nonTF-pairs Homologous TF-pairs
with dissimilar DBD
D(X,Y) EX EY
T158
Ti
T1
TF X TF Y
30416 homologous TF-pairs (BLAST E-value lt
E-10) 125 with similar binding (p-value lt 0.02)
TFs with similar binding are more similar
overall. Thus a greater expression divergence is
surprising.
In thyroid tissue the hypothesis holds
(Mann-Whitney p-value 0.00156)
31In Human, 416 homologous TFs, 125 with similar
binding In a total of 158 samples (Novartis)
p-value Number of Human Tissues MW test
0.1 91.7 (145)
0.05 87.3 (138)
0.01 74.7 (118)
In Yeast, 219 homologous TFs, 35 with similar
binding In a total of 57 samples (Spellman)
p-value Number of Yeast Samples
0.1 49.1 (28)
0.05 33.3 (19)
0.01 1.8 (1)
32Overview
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
33Transcription Factor cooperation/interaction
Expression Coherence
Pilpel et al. (2001). Nat Genet, Banerjee and
Zhang (2003) NAR
Positional Coherence
Hannenhalli and Levy (2002). NAR.
Interaction-dependent binding
34Interaction-dependent binding
ChIP-chip
Set of gene promoters bound by F
DNA binding motif M of F
Transcription Factor F
Can M discriminate between P and B?
Bound promoters (P)
Unbound promoters (B)
The answer is NO for a large fraction of
transcription factors
Perhaps binding of F depends (synergistic or
antagonistic) on other motifs
35PWM based occupancy probability
PWM based occupancy probability
Binding probability (ChIP)
Interaction coefficient
- The ChIP-chip data for a majority of TFs is
better explained using interaction-dependent
binding. - Almost all of the Yeast cell cycle interactions
were detected at 10 prediction rate - When applied to genome-wide CREB binding in rat,
15 of the 18 detected interactions have varying
degree of support.
- Wang, Jensen, Hannenhalli RECOMB-Regulation 2005
36Overview
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
37Co-regulated genes have common binding sites in
their promoters
Apoptosis Pathway
BCL2-antagonist(BAD)
68 TFs
37 TFs in common
B-cell CLL/lymphoma 2(BCL2)
89 TFs
AP-2, CREB, E2F, cMyc, NF-Kappa-b, c-ETS, Egr-1
etc.
374
Hypergeometric p-val E-11
68
37
89
38Interacting proteins have greater similarity in
their promoter regions
Hannenhalli and Levy (2003). Mamm Genome
39Transcriptional module discovery
TFs
Singular Value Decomposition
1 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 1
1 0 0 0 1 1 1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0
Genes
Clique enumeration in bipartite graphs
Cluster of genes and discriminating TF
Distance Matrix
K-means Clustering
40Tissue-Specific Transcriptional Module
Tissue specificityby expression levelSchug et
al 2005
Binding prediction
Transcriptional-Module specific to a tissue type
Everett, Wang, Hannenhalli, ISMB 2006
41Overview
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
42Transcriptional Regulation in Cardiac Myocytes
Frey N, Olson EN. Annu Rev Physiol.
20036545-79.
43Expression profiling in advanced heart failure
- Large tissue bank from Temple and Penn
- Failing explanted hearts (n173)
- Non-failing hearts from unused donors (n16)
- Each hybridized with an HU133A (n189)
- Conservative analysis RMA (bioconductor), SAM
3000 dysregulated genes in advanced human HF
with FDR lt 5.
Is there any evidence that specific transcription
factors are directing these changes?
44Transcriptional Genomics
45Differentially expressed Genes (G)
Score(x) freq(x) in G / freq(x) in B
Statistical Significance is computed using 1000
random sampling of genes from background set
Background Set (B)
46Transcription Factors enriched in differentially
up-regulated genes
TRANSFAC ID Fold enrichment p-value Factor
M00471 1.70 0.000 TBP
M00318 1.63 0.001 Lentiviral_Poly_A
M00062 1.52 0.000 IRF-1
M00138 1.50 0.004 Octamer
M00291 1.48 0.000 Freac-3
M00403 1.48 0.001 aMEF-2
M00103 1.48 0.000 Clox
M00216 1.47 0.000 TATA
M01000 1.46 0.001 AIRE
M00109 1.46 0.000 C/EBPbeta
M00405 1.45 0.001 MEF-2
M00451 1.45 0.004 NKX3A
M00972 1.44 0.001 IRF
M00249 1.43 0.002 CHOPC/EBPalpha
M00102 1.43 0.002 CDP
M00302 1.43 0.000 NF-AT
M00729 1.42 0.003 Cdx-2
M00622 1.41 0.001 C/EBPgamma
M00078 1.41 0.005 Evi-1
M00407 1.40 0.003 RSRFC4
M00616 1.39 0.004 AFP1
M00310 1.35 0.000 APOLYA
M00770 1.35 0.002 C/EBP
M00485 1.34 0.002 Nkx2-2
M00432 1.34 0.004 TTF1
M00346 1.34 0.002 GATA-1
M00478 1.34 0.003 Cdc5
M00724 1.33 0.005 HNF-3alpha
M00699 1.32 0.002 ICSBP
M00394 1.31 0.002 Msx-1
M00088 1.28 0.005 Ik-3
M00238 1.27 0.005 Barbie_Box
47What about early events?
- The differentially upregulated genes have a
greater number (32) of enriched TFs compared to
downregulated genes (6). - The ischemic and idiopathic cases are consistent
- Validation of GATA, MEF2, NKx, NFAT transcription
factors in human heart failure - Potential role for FOX factors and IRF
Mice with infarcts and sham operated controls
sacrificed at varying times after surgery (1, 4,
8, 24 hrs, 8 wks) Analysis of differentially
co-regulated gene clusters reveal consistent set
of transcription factors.
48FOX factor Summary
- FOX targets change substantially in advanced
human HF and in early HF in mice. - FOX factors are present in human heart at
physiologic levels FOXP1, P4, C1, C2, J2 - FOXP1 is localized to nuclei of human cardiac
myocytes. - Do FOX factors mediate cardiac hypertrophy?
Hannenhalli et al. Circulation, 2006
49Gene Regulation in Learning and Memory
Naïve (N) Conditioned Stimulus only (CS) Fear
Conditioned (FC)
Hippocampus Amygdala
Keeley et al. Memory and Learning, 2006
50Immediate Early Gene Expression is Regulated by
Many Transcription Factors
http//web1.tch.harvard.edu/research/greenberg/old
site/Pathways.html
5150 Most Significantly Regulated Genes were Used
for Further Analysis
Hippocampus
Amygdala
52Hippocampus- and Amygdala-specific promoter
modeling
- Hippocampus
- CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Nrf-1
- Amygdala
- CREB, E2F1, Pax4, Sp1, GATA1, AP2, ZF5, Ets1,
Elk1, Myc/Max, USF
53Promoter models were able to predict regulation
of less significant genes with some system
specificity
54Overview
Core promoter prediction TF-DNA binding TF-TF
interactions Transcriptional Modules Application
s
55Core Promoter Minimal DNA sequence required for
the assembly of the Pre-initiation complex (100
bps flanking the TSS) Goal Determine sequence
properties responsible for precise Pol-II
localiazation
56CpG island line
PromoterInspector
PromoterScan
Hannenhalli
PromFind
FirstEF
Promoter1.0
NNPP
TSSG
PSPA
CorePromoter
TATA
Calverie
Autogene
Dragon
1995
2006
2000
1990
57CpG Islands
Unmethylated GC-rich regions (experimental)
GC-rich regions (? 200 bp) on the genome with
high CG di-nucleotide frequency (computational)
Gardiner-Garden and Frommer, 1987
About half of all genes have a CpG island
overlapping the first exon.
Antequera and Bird, 1993
58Categories of DNA sequence signals used in
promoter prediction
Generalization of Markov Models Wang and
Hannenhalli, BMC BI, 2005
Long range sequence Characteristics(10kb)
TSS
Short genomic Sub regional signal, eg. CpG
island(0.52kb)
Specific cis elements (eg. TATA)
59Position Specific Propensity Analysis (PSPA)
PSPA based Model
Use -100bp around TSS as training
Wang and Hannenhalli, BBRC, 2006
60Overlap between prediction tools
61Carninci et al. (2006). "Genome-wide analysis of
mammalian promoter architecture and evolution."
Nat Genet 38(6) 626-635.
62- CpG poor promoters have greater conservation and
fewer aTSS and mostly involved in extra-cellular
and stress-response activities. - By including position specific motifs and their
co-occurrence, PSPA improves the Transcription
Start site localization. - Many Position Specific elements are associated
with target gene function. - There is little overlap among various
state-of-the-art prediction tools. - Alternative promoters have tissue specific usage
63Acknowledgement
Junwen Wang PCBI, UPenn Larry Singh PCBI,
UPenn Li-San Wang Biology, UPenn Shane
Jensen Statistics, Wharton, UPenn Perry
Evans Greg Donahue Genomics and Comp Bio,
Upenn Tom Cappola Cardiology, UPenn Mike
Keeley Biology, Upenn Ted Abel Biology, Upenn