Title: Review: RECOMB Satellite Workshop on Regulatory Genomics
1Review RECOMB Satellite Workshop on Regulatory
Genomics
2Workshop Themes/Trends
- More comprehensive evaluations of motif-detection
algorithms - Making more effective use of comparative
mapping/evolution data - Models that explain rather than just describe
- Moving from binding motifs to entire regulatory
modules - Methods are simple not sophisticated
3Outline
- Jim Kadonaga, University of California, San
DiegoThe MTE, a New Core Promoter Element for
Transcription by RNA Polymerase II - Rotem Sorek, Compugen and Tel Aviv UniversityThe
"promoters" of splicing Intronic sequences that
regulate alternative splicing - Yitzhak Pilpel, Weizman InstituteRevealing the
architecture of genetic backup circuits through
inspection of transcription regulatory networks - Ron Shamir, Tel Aviv UniversityRevealing
selection patterns in the evolution of yeast
transcription regulation - Michael Eisen, Lawrence Berkeley National Lab
Evolutionary Signatures of Regulatory Sequences
4A New Core Promoter Element for Transcription by
RNA Polymerase II(Jim Kadonaga)
- The majority of transcription activity is
regulated by sequence-specific DNA-binding
factors, which are thus the focus of the bulk of
current research on regulation, however... - The ultimate target of all of this action is the
core promoter, which also plays a part in
regulation
5- Core promoter
- Encompasses TSS
- Directs RNA polymerase II
- Most well-known component is the TATA box
6- Core promoter
- Encompasses TSS
- Directs RNA polymerase II
- Most well-known component is the TATA box
Only about 30-40 of promoters contain a TATA
box! Whats going on the rest of the time?
7Finding Novel Promoter Elements
- Experimentally investigated binding in those
promoters with no TATA-box - found novel promoter element DPE
- Large scale motif detection of 2000 core
promoters in Drosophila (Ohler et al, 2002) - Plotted distance of top 10 motifs to TSS
- four motifs had clear peak TATA, Inr, DPE and
... - a novel promoter element MTE
8The Core Promoter gets a new look
MTE Motif Ten Promoter Element
(Kadonaga, powerpoint slides)
9DPE and MTETwo newly Identified Promoter Elements
- Conserved from Drosophila to human (unknown
whether occur in yeast) - Very sensitive to spacing to Inr motif
- experimentally found TSS (papers not reliable)
- single insertion/delection between motifs causes
7-fold reduction in transcription - Inr and DPE (or MTE) bound cooperatively by TFIID
- first step in transcription initiation
10TATA gets top billing but...
- In Drosophila (out of 205 core promoters)
- TATA and DPE 14
- TATA only 29
- DPE only 26
- Neither 31
- TATA, DPE, and MTE can all
- independently support transcription
- compensate for mutation in one other
11And finally... regulation.
- NC2 previously known to repress TATA-dependent
transcription unexpectedly found to activate
DPE-dependent transcripton - Studied 18 enhancers and estimate that about 25
exhibit some specificity for DPE or TATA - Similar work in progress for MTE
12The Promoters of Splicing (Rotem Sorek)
- In general it is not known how alternative
- splicing (AS) is regulated
- A few known splicing regulatory proteins
- like TFs they are sequence-specific, but they
bind to RNA not DNA - binding motif (usually 4-10 nt) can be located in
exon or intron - can act as enhancers or silencers
- Evidence for combinatorial regulation
13The typical motif in a haystack
- Most work on finding splicing
factor motifs focuses on exons - short enough that mutation studies feasible
- Introns too long, require a computational
approach - Compiled training dataset
- 250 AS exons, AS both in mouse/human
- large set of constituitively spliced (CS) exons,
conserved across human/mouse
ATTCA
14Sorek and Ast, Genome Research 2003
15- Their Primary Finding there tends to be
significantly more conservation in introns
surrounding AS exons than CS exons - On average about 100 bases on either side of each
exon are conserved, compared to around 7 bases
for constituitively spliced exons - Whats the explanation?
- multiple binding motifs?
- helping to determine secondary structure in RNA,
which helps lead to correct splicing?
16Predicting Alternative Splicing
- Additional Predictive features
- Higher conservation around exon
- Higher conservation of exon itself (motifs?)
- Shorter exons
- Exons that are a multiple of 3
- Method somehow chose one threshold for each
feature? - Performance scanned human genome, predicted 1000
AS exons (incl training data?) - 70 had EST evidence of AS vs 6-7 baseline
- Lab test showed that 7/15 (randomly?) selected
from remaining 30 are AS in at least one of 15
tissues - Significance estimate splicing promoters cover
3x106 bp
17Genetic Backup Circuits(Kafri and Pilpel)
- Fact single gene knockouts often have little or
no phenotypic effect - 10 lethal in worm
- 27 lethal in yeast
- Question Can we better understand the mechanisms
of genetic backup? - Task Predict whether a knockout will be lethal
or not
18Duplicates Suggest Redundancy
- Genes with duplicates are less likely to be
essential - But clearly this doesnt tell the whole story
- lethal genes can have duplicates
- nonessential genes often have no duplicate
(Gu, Z. et al Nature 2003)
19Function of Duplicate Matters
- Compute dispensability of yeast genes
- growth rate after knockout compared to mean
growth rate, averaged over many conditions - Compared GO functional annotations of highly
similar genes. Found higher dispensability when - higher functional similarity (Resnik info
content) - little functional similarity but high sequence
similarity (Blast E-values)
20Similarity of Expression
- 40 time series, 500 timepoints
- In each condition calculated correlation of
expression profiles of each pair of paralogous
genes - Average correlation suggests
- backup is best provided by genes which do not
share expression patterns
21How can we explain this unexpected result?
- Classify pairs into
- negative correlation
- never similarly expressed
- positive correlation
- always similarly expressed
- no correlation
- never similarly expressed or
- similarly expressed in certain conditions
22Variability of Expression
- Use stdDev to quantify consistency of correlation
across conditions
23Goldilocks and the three little paralogs
Expression correlated in only a subset of
conditions Just Right
Always Same Expression Too Similar
Never Same Expression Too Diverged
- Optimal backup requires the ability to switch
between similar and dissimilar expression in a
condition dependent manner
24Predictions about the Past...
- Hypothesized Duplication Mechanism
- duplication occurs
- leads to nonstable redundancy
- quickly followed by either
- mutation and loss of one of the duplicates
- subfunctionalization leading to stable redundancy
- Hypothesize two distinct types of
subfunctionalization - mutation of coding region leading to functional
divergence - mutation of control region leading to divergence
of expression
25Need for Regulatory Flexibility
- This second type of subfunctionalization would
entail a quite significant regulatory challenge
if the paralogs are to provide backup for one
another - Upon mutation of B, A must be turned on in the
conditions that would normally require B - Postulate that
- this regulatory challenge is met when a gene has
a significant amount of regulatory diversity
(i.e. different TF motifs) - backup asymmetry arises when one of the genes has
few motifs (Kellis suggests otherwise?)
26Experiments, but no hard numbers
- Claim the capacity of genes to respond at the
transcriptional level when their counterpart is
deleted is central to their ability to provide
backup - Most paralogs downregulated when other gene is
knocked out (cross-hybridization?) - lower stdev -gt down regulation
- Claim that asymmetry of backup capability can be
predicted based on number of transcription factor
binding sites. - Gene that has the larger number of motifs is the
one that is capable of providing a backup to the
other - Genes with few motifs are parasites cant
backup - Claim an improved ability to predict effect of
double knockouts
27A Question
- They claim that only when the genes diverge in
function will they be maintained in evolution. - But if the duplicated pair can compensate for
each others function then wont there be little
selection pressure to maintain both copies?
28From General Conservation to Specific Motifs
- Searched conserved intronic regions for
overrepresented hexamer - literature search for most significant hexamer
shows that hexamer mentioned as an AS motif in
six papers - Next steps
- identify the consensus sequences of additional
motifs - learn tissue/developmental specificity for each
motif
29Revealing Selection Patterns in the Evolution of
Yeast Transcription Regulation(Amos Tanay, Irit
Gat-Viks and Ron Shamir)
- Identifying TF binding sites is hard
- Even harder to predict more complex interactions
- rarely a binary switch
- not a linear relation between affinity and
acivation - different binding affinities can lead to
different results (e.g. P53 can lead to apoptosis
or rescue) - Conservation indicates functionality
- Evolution dynamics disclose details of
functionality
30An AnalogyImagine we didnt know the genetic
code, but just the length of the codes
- We know that synonymous substitutions are more
common in coding regions than nonsynonymous
substitutions
- build a network where each 3-letter nt string is
represented by one node - put an edge between nodes where the thickness of
the edge represents the frequency of mutations in
aligned coding regions of related organisms - see strongly connected components comprised of
nodes which all code for the same amino acid
31A Simple Approach
- Chose to use the four recent genomes of simple
yeasts (promoter regions are relatively short) - Identified 4000 promoters and aligned them using
ClustalW - Use simple window scanning method to identify all
motifs of size 8 - Simple parsimony method to infer ancestral
sequences at each node in the phylogeny
32A Simple Approach (2)
- Calculate background substitution rate
- 16 parameter background model for each branch in
phylogeny - For each motif, compute 8 tables of site-specific
substitution rates - simply count observed substitutions at each site,
summed over all branches of the tree and all
instances of the motif - normalized substitution rate log of ratio of
observed substitutions over expected substitutions
33Building a Selection Network
- Each node represents an 8mer motif
- Connect all motifs that are 1 substitution apart
- if substitution rate is positive, dark edge
- if substitution rate is negative, light edge
- if not enough data, very thin edge
34images taken from http//www.cs.tau.ac.il/amos/p
romoter_evo/
35- Did some larger scale evaluations based on ChiP
and gene expression data - Also some anectodal results
36Matrix of Substitutions from the Motif Concensus
37Evolutionary Signatures of Regulatory Sequences
(Michael Eisen)
- Examples of Evolutionary Signatures
- coding sequence conserved conserved variable
- structural RNA, nt that basepair are coevolving
- What are the evolutionary constraints
- imposed on sequences by TF binding?
- Aligned 4 yeast species
- for each base in genome, estimate evolutionary
rate (very noisy estimates)
38Analyze the pattern of rate variation across the
entire binding site
Moses et al Evol Biol 2003
39Position-specific Rate Variation
- The pattern of rate variation across the entire
binding site for a particular TF - within one genome
- across genomes
40Position-specific Rate Variation
- The pattern of rate variation across the entire
binding site for a particular TF - within one genome
- across genomes
- Clearly due to structural constraints
- protein contacts
- even when we know theres no contact, theres DNA
bending issues....
Highly Correlated
41These signatures are missing from current
motif-prediction programs
- Although this isnt a particularly suprising
result, many predicted motifs (e.g. from MEME
etc.) do not display this TFBS signature - could use as a filter, or incorporate it more
directly (theyre working on this currently?) - Different families of TF have different
signatures - Eisen thinks the community is still
underutilizing this information
42Make better use of comparative data by using an
explicit evolutionary model
- Is there likely to have been a TFBS in the
ancestor? - build a PSSM representing the chemical
contribution of each base to the binding
specificity - use Halpern and Bruno model to predict how the
TFBS will evolve given proposal selection model
43Make better use of comparative data by using an
explicit evolutionary model
Moses et al Evol Biol 2003
44Larger Cis-Regulatory Sequences
- Known binding patterns in Drosophila have low
information content - find a sequence match for each TFBS before almost
every gene in the genome - Build a statistical model to identify significant
clusters of binding sites in windows of arbitrary
size - improved detection of cis-regulatory modules
- experimental results still show many false
positives - Use comparative data to discriminate real
clusters from false ones
45How to use comparative data
- Conservation in Drosophila pseudoobscura isnt a
good indicator of functionality - all real and fake clusters have very high overall
sequence conservation, including their flanking
regions (a surprise) - However...
- the actual binding sites are often not conserved
- even one or two mutations can destroy a binding
site - conservation of binding site density
- is a useful indicator of function
46An Impassioned Speech on the Evolution of the
Scientific Journal
- If you publish your work in a journal like
Science which fewer and fewer people in the world
have access to you run a really big risk of being
the next Mendel and that your work will languish
in obscurity - Dont publish in a journal that takes your
writing, your ideas, thoughts and paper and
claims ownership of them and then only doles them
out to a relatively narrow bunch of people who
have enough money to pay for them..solely to
promote the financial health of the journal... - Dont be like Microsoft... publish in Public
Library of Science or another freely available
journal
47For More Information
- Most of the talks I picked were invited talks
- For the workshop there there is often only an
abstract - Video feed is available online
http//www.calit2.net/multimedia/recomb2004videos.
html - Many have papers that have just come out or are
about to come out with additional details...
check the authors webpages
48Variability of Expression
- Best backup provided by duplicates which have
similar expression patterns in only a subset of
conditions
49Evolution and Larger Cis-Regulatory Sequences
- what are enhancer? whole regions of binding
sites? - how are Drosophila enhancers organized
- only 5 binding sites whose specificities are well
characterized from experim. studies - low information content
- find them all over the genome
- Clusters of binding sites -gt Surrogate for
regulatory function - Shown previously that if look for clusters of
these sites - all identified regions overlap known enhancers
- dont find anything else
- then I dont understand next study with 39
clusters
50- Found 39 clusters
- 9 overlap known enhancers
- 28 tested experimentally
- 6 clearly regulating nearby gene
- 3 shown some regulatory role perhaps
- remainder dont appear to be real (but could have
wrong promoter? look back at donoga talk) - Whats difference between real and fake?
- use comparative mapping
51- Used two flies (which ones)
- distant enough based on coding region
conservation that expect to see conservation only
of funtionally conserved regions - not the case
- all real and fake clusters have very high overall
sequence conservation, including their flanking
regions (why?)
52- However,
- binding sites not conserved
- one or two mutaitons enough to destroy a binding
site - measure conservation of binding site density
- show graph (3718)
- summary (3921)
- In more distantly related species
- alignment more of an issue
- binding sites will move around more
- been shown that huge binding site turnover will
have 2 separate ways to make the same enhancer - no sequence identity but in experimental studies
can replace each other?