Title: Computational Models of Function and Evolution of cisRegulatory Sequences
1Computational Models of Function and Evolution of
cis-Regulatory Sequences
- Xin He (Advisor Saurabh Sinha)
- Department of Computer Science
- University of Illinois, Urbana-Champaign
2cis-Regulatory Modules
The spatial-temporal expression pattern of a gene
is controlled by its cis-regulatory module (CRM).
A CRM contains bindings sites for one or more
transcription factors (TFs) and drives expression
of the target gene.
3Motivations
- Evolution of cis-regulatory modules
- Evolutionary changes in CRMs are believed to give
rise to new morphology - Better understanding of CRM evolution will lead
to better discovery of CRMs through comparative
genomics - Sequence-function relationship of CRMs
- Important for analysis of genomic data
sequences, ChIP-binding, gene expression - Important for understanding how CRMs carry their
function
4Existing Approaches
- Existing approaches for cross-species CRM
analysis - Assume the orthologous sequences are accurately
aligned, but alignments are error prone in large
evolutionary time. - Assume a transcription factor binding site (TFBS)
must be conserved in all aligned sequences, but
binding site gain and loss are common even in
relatively close species - Existing quantitative approaches to CRM
sequence-function relationship - Often based on general formalisms from statistics
and machine learning that do not reflect the
underlying biological process - Miss important biochemical mechanisms of gene
regulation such as interactions among TFs
5Pairwise Model of CRM Evolution EMMA
An evolutionary model on an entire cis-regulatory
module
6Model of TFBS Gain and Loss - EMMA
Time
0
- A functional site is initially under constraint
- At some moment, a mutation disrupts the site
(binding energy no longer satisfies threshold) - No longer constrained afterwards
t
t
7Alignment of LAGAN vs EMMA
A. LAGAN alignment TFBSs are mis-aligned at the
boundary, or shifted by a few bps or completely
unaligned
B. EMMA fixes all these alignment errors.
8Regulatory Target Prediction - EMMA
- Classification of sequences known targets vs
random - Two Drosophila species Mel, Pse
9Multi-species CRM Model STEMMA
- Constant rate of loss, µ, at each existing
functional site - Constant rate of gain,?, at each background
nucleotide
10Regulatory Target Prediction - STEMMA
- STUBB-mel no conservation information
- STUBB-avg heuristic modeling of sequence
conservation - STEMMA-NT evolutionary model without binding
site turnover (gain and loss)
11Evolution of CRM in 12 Drosophila Species (I)
Epistatic interactions among different positions
of a TFBS. HB model assumes that different
positions of a TFBS evolve independently SS
model assumes that binding sites evolve as a
single unit with possible dependence among
positions. SS model has a smaller sum of squared
error (SSE) when comparing with observed data. X
axis evolutionary change of binding energy Y
axis frequency of that change.
12Evolution of CRM in 12 Drosophila Species (II)
Binding site loss process roughly follows a
molecular clock. X axis divergence between D.
melanogaster and another Drosophila species Y
axis the faction of conserved binding sites.
13A Biophysical Model of TF-DNA Interactions
- Configurations a sequence with n binding sites
has 2n configurations, each one corresponding to
occupancy states of all sites (each site is
either occupied or not). - Probabilities of configurations a configuration
exists with a certain probability, determined by
1) TF-DNA binding stronger binding, larger
probability (qA, qB terms) 2) TF-TF
interactions more interaction, larger
probability (wAB term) - The total binding affinity of the sequence to a
factor A is the number of A molecules bound in
the sequence, averaging over all configurations
weighted by their probabilities. - For each configuration shown are the probability
(relative value) and the number of bound A
molecules in that configuration
14Analysis of ChIP-binding Data by Biophysical
Modeling
Cooperative interactions among TFs are important
for explaining DNA binding.
- Coop/Non-coop model with/without cooperative
interactions. - The numbers are Pearson correlations among
predicted and observed binding affinities.
15Co-localization of TF Molecules
Understandings from applying our model
- Two TF molecules can be co-localized with only
one molecule binding to DNA. - Two molecules of different TFs can bind to DNA
independently without interaction. - Two TF molecules can bind to DNA cooperatively
binding of one may facilitate binding of the
other.
16Predicting Expression Patterns from Regulatory
Sequences
An integrated biophysical model
- Cooperative binding of two adjacent TF molecules
- Quenching (short-range repression) of an
activator molecule by an nearby repressor
molecule - Transcriptional synergism between two activator
molecules in simultaneous contact with basal
transcriptional machinery (BTM)
- DNA binding of multiple TFs
- Interaction of bound TF molecules with BTM
determines gene activation