Title: CS 5263 Bioinformatics
1CS 5263 Bioinformatics
- Reverse-engineering Gene Regulatory Networks
2Genes and Proteins
Gene (DNA)
Transcriptional regulation
Transcription (also called expression)
mRNA
mRNA degradation
Translational regulation
Translation
(De)activation
Protein
Post-translational regulation
3Gene Regulatory Networks
- Functioning of cell controlled by interactions
between genes and proteins - Genetic regulatory network genes, proteins, and
their mutual regulatory interactions
repressor
gene 1
activator
repressor
gene 2
gene 3
4Reverse-engineering GRNs
- GRNs are large, complex, and dynamic
- Reconstruct the network from observed gene
expression behaviors - Experimental methods focus on a few genes only
- Computer-assisted analysis large scale
- Since 1960s
- Theoretical study mostly
- Attracting much attention since the invent of
Microarray technology - Emerging advanced large-scale assay techniques
are making it even more feasible (ChIP-chip,
ChIP-seq, etc.)
5Problem Statement
- Assumption expression value of a gene depends on
the expression values of a set of other genes - Given a set of gene expression values under
different conditions - Goal a function for each gene that predicts its
expression value from expression of other genes - Probabilistically Bayesian network
- Boolean functions Boolean network
- Linear functions linear model
- Other possibilities such as decision trees, SVMs
6Characteristics
- Gene expression data is often noisy, with missing
values - Only measures mRNA level
- Many genes regulated not only on the
transcriptional level - genes gtgt experiments. Underdetermined
problem!!!! - Correlation ? causality
- Good news Network structure is sparse
(scale-free)
7Methods for GRN inference
- Directed and undirected graphs
- E.g. KEGG, EcoCyc
- Boolean networks
- Kauffman (1969), Liang et al (1999), Shmulevich
et al (2002), Lähdesmäki et al (2003) - Bayesian networks
- Friedman et al (2000), Murphy and Mian (1999),
Hartmink et al (2002) - Linear/non-linear regression models
- DHaeseleer et al (1999), Yeung et al (2002)
- Differential equations
- Chen, He Church (1999)
- Neural networks
- Weaver, Workman and Stormo (1999)
8Boolean Networks
- Genes are either on or off (expressed or not
expressed) - State of gene Xi at time t is a Boolean function
of the states of some other genes at time t-1
X Y Z X Y Z
0 0 0 0 0 0
0 0 1 0 0 0
0 1 0 1 0 1
0 1 1 0 0 1
1 0 0 0 1 0
1 0 1 0 1 0
1 1 0 1 1 1
1 1 1 0 1 1
X
Y
Z
X
Y
Z
X Y and (not Z) Y X Z Y
9Learning Boolean Networks for Gene Expression
- Assumptions
- Deterministic (wiring does not change)
- Synchronized update
- All Boolean functions are probable
- Data needed 2N for N genes. (In comparison, N
needed for linear models) - General techniques limit the of inputs per
gene (k). Data required reduced to 2k log(N).
10Learning Boolean Networks
- Consistency Problem
- Given Examples S ltIn, Outgt, where
- In ?0,1k, output ? 0,1
- Goal learn Boolean function f such that for
every ltIn, Outgt ? S, f(In) out. - Note
- Given the same input, the output is unique.
- For k input variables, there are at most 2k
distinct input configurations. - Example
- lt001,1gt lt101,1gt lt110,1gt lt010,0gt lt011,0gt lt101,0gt
- 1,1 5,1 6,1 2,0
3,0 5,0
11Learning Boolean Networks
lt001,1gt lt101,1gt lt110,1gt lt010,0gt lt101,1gt lt101,0gt
?
- no clash -gt consistency.
- Question marks -gt undetermined elements
- O (Mk), M is of experiments
- N genes, Choose k from N,
- N C(N, k) O(MK)
1
0
0
?
1
?
- Best-fit problem Find a function f with minimum
of errors - Limited error-size problem Find all functions
with error-size within ?max
Lähdesmäki et al, Machine Learning 200352
147-167.
12(No Transcript)
13State space and attractor basins
14What are some biological interpretations of
basins and attractors?
15(No Transcript)
16Linear Models
- Expression level of gene at time t depends
linearly on the expression levels of some genes
at time t-1
- Basic model Xi (t) Sj Wij Xj(t-1)
- Xi (t) Sj Aij Xj(t), where Xi(t) can be
measured, Xi (t) can be estimated from Xi(t) - In matrix form XN?M AN?N XN?M , where M is
the number of time points, N is the number of
genes
t-1
t
W11
X1
X1
W21
W31
X2
X2
W31
W32
X3
X3
W33
17Linear Models (contd)
- XN?M AN?N XN?M
- AN?N connectivity matrix, Aij describes the type
and strength of the influence of the jth gene on
the ith gene. - To solve A, need to solve MN linear equations
- In general N2 gtgt MN, therefore under-determined
gt infinity number of solutions
18Get Around The Curse of Dimensionality
- Non-linear interpolation to increase of time
points - Cluster genes to reduce of genes
- Singular Value Decomposition (SVD)
- A A0 CN?N VTN?N, where cij 0 if j gt M
- Take A0 as a solution, guaranteed smallest sum of
squares. - Robust regression
- Minimize of edges in the network
- Biological networks are sparse (scale-free)
CN?N
Cij
0
19Robust Regression
- A A0 CN?N VTN?N,
- Minimizing of non-zero entries in A by
selecting C - Set A 0, then C VT -A0 , solve for C.
- Over-determined. (N2 equations, MN free
variables). - Robust regression
- Fit a hyper-plane to a set of points by passing
as many points as possible
20Simulation Experiments
SVD alone
SVD Robust Regression
Yeung et al, PNAS. 2002996163-8.
21Simulation Experiments (contd)
Nonlinear System close to steady state
Linear System
- Does not work for nonlinear system not close to
steady state - Scale-free property does not hold on small
networks
22Bayesian Networks
- A DAG G (V, E), where
- Vertex a random variable
- Edge conditional distribution for a variable,
given its parents in G. - Markov assumption
- ?i, I (Xi, non-descendent(Xi) PaG(Xi))
- e.g. I(X3, X4 X2), I(X1, X5 X3)
X1
X2
X4
X3
X5
Chain rule P(X1, X2, , Xn) ?i P(Xi PaG(Xi),
i 1..n
P (X1, X2, X3, X4, X5) P(X1) P(X2) P(X3 X1,
X2) P (X4 X2) P(X5 X3)
Learning argmaxG P (G D) P (D G) P (G) /
C
23Bayesian Networks (Contd)
- Equivalence classes of Bayesian Networks
- Same topology, different edge directions
- Can not be distinguished from observation
- Causality
- Bayesian network does not directly imply
causality - Can be inferred from observation with certain
assumptions - no hidden common cause
C
C
A
B
A
B
I (A, B C)
C
PDAG
A
B
Hidden variable
C
A
B
24Bayesian Networks for Gene Expression
- Deals with noisy data well, reflects stochastic
nature of gene expression - Indication of causality
- Practical issues
- Learning is NP-hard
- Over-fitting
- Equivalent classes of graphs
- Solution
- Heuristic search, sparse candidate
- Model averaging
- Learning partial models
Gene E
Gene C
Gene D
Gene A
? (D E) Multinomial or linear
Gene B
Other variables can be added, such as promoters
sequences, experiment conditions and time.
25Learning Bayesian Nets
- Find G to maximize Score (G D), where
- Score(G D) Si Score (Xi, PaG(Xi) D)
- Hill-climbing
- Edge addition, edge removal, edge reversal
- Divide-and-conquer
- Solve for sub-graphs
- Sparse candidate algorithm
- Limit the number of candidate parents for each
variables. (Biological implications sparse
graph) - Iteratively modifying the candidate set
26Partial Models (Features)
- Model Averaging
- Learn many models, common sub-graphs will be more
likely to be true - Confidence measure of times a sub-graph
appeared - Method bootstrap
- Markov relations
- A is in Bs Markov blanket iff
A and B in some joint biological interaction
A
B
A
B
or
C
A is a cause of B
A
B
27Experimental Results
Markov Relations
- Real biological data set Yeast cell cycle data
- 800 genes, 76 experiments, 200-fold bootstrap
- Test for significance and robustness
- More higher scoring features in real data than in
randomized data - Order relations are more robust than Markov
relations with respect to local probability
models.
Friedman et al, J Comput Biol. 20007601-20
28Transcriptional regulatory network
- Who regulates whom?
- When?
- Where?
- How?
TF
Gene
Promoter
A and not B
RNA-Pol
RNA-Pol
A or B
B
A
g1
A
B
g3
Not (A and B)
RNA-Pol
RNA-Pol
A and B
A
B
A
B
g4
g2
PNAS 2003100(9)5136-41
29Data-driven vs. model-driven methods
condition
clustering
gene
MF
Descriptive
Learning
model
Post-processing
Biological insights
Explanatory, predictive
A description of a process that could have
generated the observed data
30Data-driven approaches
Clustering
Motif finding
Genes
MEME, Gibbs, AlignACE,
Hierarchical, K-means,
Experiments
- Assumption
- Co-expressed genes are likely co-regulated not
necessarily true - Limitations
- Clustering is subjective
- Statistically over-represented but non-functional
junk motifs - Hard to find combinatorial motifs
31Model-based approaches
- Intuition find motifs that are not only
statistically over-represented, but are also
associated with the expression patterns - E.g., a motif appears in many up-regulated genes
but very few other genes gt real motif? - Model gene expression f (TF binding motifs, TF
activities) - Goal find the function that
- Can explain the observed data and predict future
data - Captures true relationships among motifs, TFs and
expression of genes
32Transcription modeling
e f (m1, m2, m3, m4)
Variables
Motifs
Expression
Promoters
g1 g2 g3 g4 g5 g6 g7 g8
?
Gene labels
Assume that gene expression levels under a
certain condition are a function of some TF
binding motifs on their promoters.
33Different modeling approaches
- Many different models, each with its own
limitations - Classification models
- Decision tree, support vector machine (SVM),
naïve bayes, - Regression models
- Linear regression, regression tree,
- Probabilistic models
- Bayesian networks, probabilistic Boolean
networks,
34Decision tree
m1
e
m1 m2 m3 m4
g1 g2 g3 g4 g5 g6 g7 g8
yes
no
m2
m4
e f (m1, m2, m3, m4)
yes
yes
no
no
A
B
C
D
1, 2, 5
3, 6
4
7, 8
- Tree structure is learned from data
- Only relevant variables (motifs) are used
- Many possible trees, the smallest one is
preferred - Advantages
- Easy to interpret
- Can represent complex logic relationships
35A real example transcriptional regulation of
yeast stress response
- 52 genes up-regulated in heat-shock (postive)
- 156 random irresponsive genes (negative)
- 356 known motifs
Small tree only used 4 motifs All 4 motifs are
well-known to be stress-related RRPE-PAC
combination well-known
36Application to yeast cell-cycle genes
Model network in Science, 2002298(5594)799-804
Network by our method Ruan et. al., BMC Genomics,
2009
37Regression tree
e
m1 m2 m3 m4
m1
g1 g2 g3 g4 g5 g6 g7 g8
yes
no
e f (m1, m2, m3, m4)
m2
m4
no
yes
yes
no
e??2
e?2
0ltelt2
0gtegt?2
- Similar to decision tree
- Difference each terminal node predicts a range
of real values instead of a label
38Multivariate regression tree
- Multivariate labels use multiple experiments
simultaneously - Use motifs to classify genes into co-expressed
groups - Does not need clustering in advance
m1
no
yes
e1
e2
e3
e4
e5
m1 m2 m3 m4
g1 g2 g3 g4 g5 g6 g7 g8
m2
m4
yes
no
yes
no
3 6 8
1 2 5
7
4
Phuong,T., et. al., Bioinformatics, 2004
39Modeling with TF activities
- Gene expression f (binding motifs, TF
activities)
g f (tf1, tf2, tf3, tf4)
tf1 tf2 tf3 tf4
g
rotate
e1 e2 e3 e4 e5
Soinov et al., Genome Biol, 2003
40A Decision Tree Model
Segal et al. Nat Genet. 2003,34(2)166-76.
A decision tree model of gene expressions
gene
experiment
41Algorithm BDTree
- Gene expression f (binding motifs, TF
activities) - Ruan Zhang, Bioinformatics 2006
- Basic idea
- Iteratively partition an expression matrix by
splitting genes or experiments - Split of genes is according to motif scores
- Split of conditions is according to TF expression
levels - The algorithm decides the best motifs or TFs to
use
42Transcriptional regulation of yeast stress
response
- 173 experiments under 20 stress conditions
- 1411 differentially expressed genes
- 1200 putative binding motifs
- Combination of ChIP-chip data, PWMs, and
over-represented k-mers (k 5, 6, 7) - 466 TFs
43Genes
Experiments
44Biological validation
- Most motifs and TFs selected by the tree are
well-known to be stress-related - E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and Ppt1
- 42 / 50 blocks are significantly enriched with
some Gene Ontology (GO) functional terms - 45 / 50 blocks are significantly enriched with
some experimental conditions
45RRPE PAC, ribosome biogenesis (60/94, p lt e-65)
RRPE only, ribosome biogenesis (28/99, p lt e-18)
FHL1, protein biosynthesis (98/105, plte-87)
STRE (agggg) carbohydrate metabolism p lt e-20
PAC
Nitrogen metabolism
46Relationship between methods
c1 c2 c3 c4 c5
B
- A, C from promoter to expression
- A single cond
- C multi conds
- B, D from expression to expression
- B single gene
- D multi genes
t1 t2 t3 t4
m1 m2 m3 m4
A
g1 g2 g3 g4 g5 g6 g7 g8
D
C