Title: Insights from Boolean Modeling of Genetic Regulatory Networks
1Insights from Boolean Modeling of Genetic
Regulatory Networks
2Part I
- Discover and understand the underlying gene
regulatory mechanisms by means of inferring them
from data. - By using the inferred model, endeavor to make
useful predictions by mathematical analysis and
computer simulations.
3genetic networks
- Complex regulatory networks among genes and their
products control cell behaviors such as - cell cycle
- apoptosis
- cell differentiation
- communication between cells in tissues
- A paramount problem is to understand the
dynamical interactions among these genes,
transcription factors, and signaling cascades,
which govern the integrated behavior of the cell.
Analogy circuit diagram
4Clinical Impact
- Model-based and computational analysis can
- open up a window on the physiology of an organism
and disease progression - translate into accurate diagnosis, target
identification, drug development, and treatment.
5What class of models should be chosen?
- The selection should be made in view of
- data requirements
- goals of modeling and analysis.
Goals
Data
Model
6Classical tradeoff
- A fine model with many parameters
- may be able to capture detailed low-level
phenomena (protein concentrations, reaction
kinetics) - requires large amounts of data for inference
- A coarse model with low complexity
- may succeed in capturing only high-level
phenomena (e.g. which genes are ON/OFF) - requires smaller amounts of data
7Ockhams Razor
- Underlies all scientific theory building.
- Model complexity should never be made higher than
what is necessary to faithfully explain the
data. - What kind of data do we have and how much?
William of Ockham (1280-1349)
8Boolean Networks
- To what extent do such models represent reality?
- Do we have the right type of data to infer
these models? - What do we hope to learn from them?
9Basic Structure of Boolean Networks
1 means active/expressed 0 means
inactive/unexpressed
A
B
Boolean function A B X 0 0 1 0 1 1 1 0 0 1 1 1
X
In this example, two genes (A and B) regulate
gene X. In principle, any number of input genes
are possible. Positive/negative feedback is also
common (and necessary for homeostasis).
10Dynamics of Boolean Networks
A
B
C
D
E
F
Time
0
1
1
0
0
1
11State Space of Boolean Networks
- equate cellular states (or fates) with
attractors. - attractor states are stable under small
perturbations - most perturbations cause the network to flow back
to the attractor. - some genes are more important and changing their
activation can cause the system to transition to
a different attractor.
Picture generated using the program DDLab.
12Boolean model of the yeast filamentation network
Taylor, Galitski
13But can we extract meaningful biological
information from gene expression data entirely in
the binary domain?
- We reasoned that if genes, when quantized to only
two levels (1 or 0) would not be informative in
separating known subclasses of tumors, then there
would be little hope for Boolean inference of
real genetic networks.
14Gene expression analysis in the binary domain
- By using binary gene expression data and Hamming
distance as a similarity metric, a separation
between different subtypes of gliomas is evident,
using multidimensional scaling.
Shmulevich, I. and Zhang, W. (2002)
Bioinformatics 18(4), 555-565.
15Boolean Framework
- Limited amounts of data and the noisy nature of
the measurements can make useful quantitative
inferences problematic and a coarse-scale
qualitative modeling approach seems to be
justified. - Boolean idealization enormously simplifies the
modeling task. - We wish to study the collective regulatory
behavior without specific quantitative details. - Boolean networks qualitatively capture typical
genetic behavior.
- Albert, R Othmer, H.G. (2003) J. Theor. Biol.
223, 1-18. - Mendoza, L., Thieffry, D. Alvarez-Buylla, R.E.
(1999) Bioinformatics 15, 593-606. - Huang, S. Ingber, D. E. (2000) Exp. Cell Res.
261, 91-103. - Li F, Long T, Lu Y, Ouyang Q, Tang C. (2004)
PNAS. 101(14)4781-6.
16(No Transcript)
17Probabilistic Boolean Networks (PBN)
- Share the appealing rule-based properties of
Boolean networks. - Robust in the face of uncertainty.
- Dynamic behavior can be studied in the context of
Markov Chains. - Boolean networks are just special cases.
- Close relationship to (dynamic) Bayesian networks
- Explicitly represent probabilistic relationships
between genes. (Lähdesmäki et al. (2006) Sig.
Proc., 86(4)814-834) - Can represent the same joint probability
distribution. - Allow quantification of influence of genes on
other genes (stay tuned for examples)
Shmulevich et al. (2002) Proceedings of the IEEE,
90(11), 1778-1792.
18Basic structure of PBNs
If we have several good competing predictors
(functions) for a given gene and each one has
determinative power, dont put all our faith in
one of them!
19Model Inference from Gene Expression Data
- Two approaches
- Coefficient of Determination (Dougherty et al.
2000) - Best-Fit Extensions
Lähdesmäki et al. (2003) Machine Learning, 52,
147-167.
20Coefficient of Determination (COD)
- COD is used to discover associations between
variables. - It measures the degree to which the expression
levels of an observed gene set can be used to
improve the prediction of the expression of a
target gene relative to the best possible
prediction in the absence of observations. - Using the COD, one can find sets of genes related
multivariately to a given target gene.
21COD Definition
Target gene
Observed genes
Optimal Predictor
?i is the error of the best (constant) estimate
of xi in the absence of any conditional
variables ?opt is the optimal error achieved by
f
22Constraints During Inference
- Constraining the class of predictors can have
advantages - lessening the data requirements for reliable
estimation - incorporating prior knowledge of the class of
functions representing genetic interactions - certain classes of functions are more plausible
from the point of view of evolution, noise
resilience, network dynamics, etc.
23Example of Constraint Post Classes
Shmulevich et al. (2003) PNAS 100(19),
10734-10739.
Emil Post (1897-1954)
- The class is sufficiently large (this is
important for inference). - An abundance of functions from this class will
tend to prevent chaotic behavior in networks. - Eukaryotic cells are not chaotic! (Shmulevich et
al. (2005) PNAS 102(38), 13439-13444.) - Functions from this class have a natural way to
ensure robustness against noise and uncertainty.
24Post Class Constraints During Inference
- We compared the Post classes to the class of all
Boolean functions (i.e. no constraint) by
estimating the corresponding prediction error for
a set of target genes, using available gene
expression data. - We found that the optimal error of Post functions
compares favorably with optimal error without
constraint. - A hypothesis testing-based study gives no
statistically significant evidence against the
use of constrained function classes (i.e. cost of
constraint). - Thus, Post classes are also plausible in light of
experimental data.
25SubnetworksTheory and Examples
- aim discover relatively small subnetworks
- whose genes interact significantly and
- whose genes are not strongly influenced by genes
outside the subnetwork. - Principle of Autonomy
- Start with a seed gene set and iteratively
adjoin new genes so as to enhance subnetwork
autonomy.
26Growing Algorithm
To achieve network autonomy, both of these
strengths of connections should be high.
The sensitivity of Y from the outside should be
small.
Various stopping criteria can be used
Hashimoto et al. (2004) Bioinformatics 20(8)
1241-1247.
27Cancer tissues need nutrients. Gliomas are highly
angiogenic. Expression of VEGF is often elevated.
28VEGF is elevated in advanced stage of
gliomas Confirmation and localization by tissue
microarray
29VEGF protein is secreted outside the cells and
binds to its receptor on the endothelial cells to
promote their growth.
30Member of fibroblast growth factor family
FGF7
VEGF
PTK7
Tyrosine kinase receptor
GRB2
- The protein products of all four genes are part
of signal transduction pathways that involve
surface tyrosine kinase receptors. - These receptors, when activated, recruit a number
of adaptor proteins to relay the signal to
downstream molecules - GRB2 is one of the most crucial adaptors that
have been identified. - GRB2 is also a target for cancer intervention
because of its link to multiple growth factor
signal transduction pathways.
FSHR
Follicle-stimulating hormone receptor
31(No Transcript)
32- Such relationships should also be validated
experimentally. - The networks built from our models provide
valuable theoretical guidance for further
experiments.
33- IGFBP2 is overexpressed in high-grade gliomas
- IGFBP2 contributes to increased cell invasion.
34IGFBP2 is elevated in advanced stage of
gliomas Confirmation and localization by tissue
microarray
35IGFBP2 promotes glioma cell invasion in vitro
High IGFBP2 clone 1
Vector
Low IGFBP2 clone
High IGFBP2 clone 2
36A. Niemistö, L. Hu, O. Yli-Harja, W. Zhang, I.
Shmulevich, "Quantification of in vitro cell
invasion through image analysis," International
Conference of the IEEE Engineering in Medicine
and Biology Society (EMBS'04), San Francisco,
California, USA, Sep. 1-5, 2004.
37- A review of the literature showed that Cazals et
al. (1999) indeed demonstrated that NF?B
activated the IGFBP2 promoter in lung alveolar
epithelial cells.
IGFBP2
NF?B
38- Higher NF?B activity in IGFBP2 overexpressing
cells was also found. - Transient transfection of IGFBP2 expressing
vector together with NF?B promoter reporter gene
construct did not lead to increased NF?B
activity, suggesting an indirect effect of IGFBP2
on NF?B
IGFBP2
TNFR2
- Our real-time PCR data showed that in stable
IGFBP2-overexpressing cell lines, IGFBP2 indeed
enhances ILK expression. - In addition, IGFBP2 contains an RGD domain,
implying its interaction with integrin molecules. - ILK is in the integrin signal transduction
pathway.
ILK
NF?B
- Studies also showed that IGFBP2 affects cell
apoptosis and TNFR2 is a known regulator of
apoptosis
39PBN web page
http//personal.systemsbiology.net/ilya/PBN/PBN.ht
m
- Reprints
- Software (BN/PBN MATLAB Toolbox)
- Posters/Presentations
- Workshops
- Links
- PBN People
40PBN Collaborators
Wei Zhang Harri Lähdesmäki Olli
Yli-Harja Jaakko Astola Edward
Dougherty Ronaldo Hashimoto Marcel
Brun Seungchan Kim Edward Suh Huai Li Michael
Bittner
Support NIH/NIGMS R21 GM070600-01 NIH/NIGMS R01
GM072855-01
41Part II
42Joint work with
Stu Kauffman
Max Aldana
43Order/Chaos
- A broad body of work over the past 35 years has
shown that a variety of model genetic regulatory
networks behave in two broad regimes, ordered and
chaotic, with an analytically and numerically
demonstrated phase transition between the two.
44Edge of chaos
- The boundary between order and chaos is called
the complex regime or the critical phase. - The system can undergo a kind of phase
transition. - Networks are most evolvable at the edge of
chaos. - Living system in a variable environment
- Strike a balance malleability vs. stability
- Must be stable, but not so stable that it remains
forever static. - Must be malleable, but not so malleable that it
is fragile in the face of perturbations.
45Plausible and long-standing hypothesis Real
cells lie in the ordered regime or are critical.
Life at the edge of chaos
There has been no experimental data supporting
this hypothesis.
46Ordered networks
- Homeostasis
- A modest number of small recurrent patterns of
gene activity (attractors) - plausible models of the diverse cell types (or
cell fates) of an organism - the phenotypic traits of the organism are encoded
in the dynamical attractors of its underlying
genetic regulatory network - Confined avalanches of gene activity changes
following transient perturbations in the activity
of single genes - i.e. confined damage spreading
47Chaotic networks
- Nearby states lie on trajectories that diverge
- hence, fail to exhibit a natural basis for
homeostasis - Have enormous attractors whose sizes scale
exponentially with the number of genes - Exhibit vast avalanches of gene activity
alterations following transient perturbations to
single gene activities
48The model class
- Random Boolean Networks (RBNs) - Kauffman (1969)
ensemble approach - One of the most intensively studied models of
discrete dynamical systems. - Sustained interest from biology and physics
communities. - Considered for many years as prototypes of
nonlinear dynamical systems. - RBNs are
- Structurally simple yet capable of remarkably
rich complex behavior!
49Connectivity
Mean number of input variables
(e.g. scale-free)
50Bias
- The bias p of a random function is the
probability that it takes on the value 1. - If p 0.5, then the function is unbiased.
51Connectivity, bias, and the phase transition
Average Network Sensitivity
Chaos
Critical Phase
Order
Shmulevich Kauffman (2004) Physical Review
Letters, 93(4) 048701
52Phase transition
- RBNs can be tuned to undergo a phase transition
by - tuning the connectivity K
- tuning the bias p
- tuning the scale-free exponent ?
- Aldana Cluzel (2003) PNAS, 100(15)8710-4.
- tuning abundance of functional classes
- Shmulevich et al. (2003) PNAS 100(19)10734-9.
53Our approach
- Measure and compare the complexity of time series
data of HeLa cells with that of mock data
generated by RBNs operating in the ordered,
critical, and chaotic regimes. - We use the Lempel-Ziv (LZ) measure of complexity.
- Dataset Whitfield et al. (2002) Mol. Biol. Cell.
13, 1977-2000. - synchronized HeLa cells 48 time points at 1-hour
time intervals 29,621 distinct genes
54Lempel-Ziv Complexity
The algorithm parses the sequence into shortest
words that have not occurred previously and the
complexity is defined as the number of such
words. Words are unique, except possibly the last
one.
01100101101100100110
01010101010101010101
LZ Complexity 7
LZ Complexity 3
55Lempel-Ziv Complexity Example
01100101101100100110
LZ Complexity 7
56Lempel-Ziv Complexity some remarks
- Universal complexity measure
- Basis of powerful lossless compression schemes
(ZIP, GIF, etc.) - by replacing words with a pointer to a previous
occurrence of the same word - Optimal compression rate approaches the entropy
of the random sequence - Asymptotically Gaussian can be used for
statistical test of randomness.
57Intuition
- Genes in ordered networks have low LZ
complexities. - Genes in chaotic networks have high LZ
complexities.
58Binarization
- We used the well-known k-means algorithm with two
groups, corresponding to the two binary values
(0,1).
59Lempel-Ziv complexity distributions of binarized
HeLa data vs. random binary data
60HeLa time-series data
ordered
critical
RBN
Binarize
chaotic
01101001101001101011
10011001100100110110
(29,621 genes by 48 time points)
LZ complexities
LZ complexities
Compute distance
Find minimum
61Distance between LZ distributions
Kullback-Leibler (KL) distance
Euclidean distance
62Three techniques to tune ordered, critical, and
chaotic regimes.
- Fix p 0.5, let K 1, 2, 3, 4.
- Fix K 4, let p 0.93301, 0.85355, 0.75, 0.5.
- Scale-free topology with connectivity K(?). Vary
scale-free exponent ? such that average network
sensitivity is equal to the cases above. (Aldana
Cluzel (2003) PNAS, 100(15)8710-4)
63But what about noise?
- Wouldnt noise make things look more chaotic?
- There are two issues
- In the binary domain, the compound effect of
noise amounts to a certain percentage of values
in the time series data being flipped from zero
to one or vice versa. - Many genes are expressed at levels that are below
those corresponding to pure noise. - Fortunately, using the HeLa data, it is possible
to estimate both the binary noise probability and
the global noise floor level as follows.
64Estimate the noise floor
- There are 963 empty spots on the HeLa
microarrays. - As a conservative estimate, for each of the 48
microarrays, we used the 95th percentile of the
values of the empty spots as the noise floor
level for that array. - Only those genes whose values exceed this global
threshold at all time points are included for
further analysis. - Hence our criteria are very stringent.
65Estimate the noise probability q
- We made use of the replicated probes available on
the arrays. - 2001 duplicate gene profiles of 48 time points.
- Keeping only those that exceeded the global
threshold, we binarized each of the duplicate
profiles and computed the normalized Hamming
distance.
with a 95 bootstrap confidence interval of
0.32, 0.38.
66Euclidean (fix p 0.5, tune K)
Shmulevich et al. (2005) PNAS 102(38)13439.
67Kullback-Leibler (fix p 0.5, tune K)
Shmulevich et al. (2005) PNAS 102(38)13439.
68Euclidean (fix K 4, tune p)
Shmulevich et al. (2005) PNAS 102(38)13439.
69Kullback-Leibler (fix K 4, tune p)
Shmulevich et al. (2005) PNAS 102(38)13439.
70Euclidean, Scale-free (tune ?)
Shmulevich et al. (2005) PNAS 102(38)13439.
71Kullback-Leibler, Scale-free (tune ?)
Shmulevich et al. (2005) PNAS 102(38)13439.
72Concluding remarks
- The results strongly suggest that HeLa cells are
in the ordered regime or are critical, but not
chaotic. - We cannot statistically distinguish between
ordered and critical with these data. - Critical networks appear to predict the
distribution of genes whose activities are
altered in several hundred knock-out mutants of
yeast. (Serra et al. (2004) J. Theor. Biol. 227,
149-157) - It will be important to use more realistic
ensembles of model genetic networks to test
whether our conclusions hold.