CS 5263 Bioinformatics - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

CS 5263 Bioinformatics

Description:

CS 5263 Bioinformatics Reverse-engineering Gene Regulatory Networks – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 47
Provided by: Jianhu8
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics


1
CS 5263 Bioinformatics
  • Reverse-engineering Gene Regulatory Networks

2
Genes and Proteins
Gene (DNA)
Transcriptional regulation
Transcription (also called expression)
mRNA
mRNA degradation
Translational regulation
Translation
(De)activation
Protein
Post-translational regulation
3
Gene Regulatory Networks
  • Functioning of cell controlled by interactions
    between genes and proteins
  • Genetic regulatory network genes, proteins, and
    their mutual regulatory interactions

repressor
gene 1
activator
repressor
gene 2
gene 3
4
Reverse-engineering GRNs
  • GRNs are large, complex, and dynamic
  • Reconstruct the network from observed gene
    expression behaviors
  • Experimental methods focus on a few genes only
  • Computer-assisted analysis large scale
  • Since 1960s
  • Theoretical study mostly
  • Attracting much attention since the invent of
    Microarray technology
  • Emerging advanced large-scale assay techniques
    are making it even more feasible (ChIP-chip,
    ChIP-seq, etc.)

5
Problem Statement
  • Assumption expression value of a gene depends on
    the expression values of a set of other genes
  • Given a set of gene expression values under
    different conditions
  • Goal a function for each gene that predicts its
    expression value from expression of other genes
  • Probabilistically Bayesian network
  • Boolean functions Boolean network
  • Linear functions linear model
  • Other possibilities such as decision trees, SVMs

6
Characteristics
  • Gene expression data is often noisy, with missing
    values
  • Only measures mRNA level
  • Many genes regulated not only on the
    transcriptional level
  • genes gtgt experiments. Underdetermined
    problem!!!!
  • Correlation ? causality
  • Good news Network structure is sparse
    (scale-free)

7
Methods for GRN inference
  • Directed and undirected graphs
  • E.g. KEGG, EcoCyc
  • Boolean networks
  • Kauffman (1969), Liang et al (1999), Shmulevich
    et al (2002), Lähdesmäki et al (2003)
  • Bayesian networks
  • Friedman et al (2000), Murphy and Mian (1999),
    Hartmink et al (2002)
  • Linear/non-linear regression models
  • DHaeseleer et al (1999), Yeung et al (2002)
  • Differential equations
  • Chen, He Church (1999)
  • Neural networks
  • Weaver, Workman and Stormo (1999)

8
Boolean Networks
  • Genes are either on or off (expressed or not
    expressed)
  • State of gene Xi at time t is a Boolean function
    of the states of some other genes at time t-1

X Y Z X Y Z
0 0 0 0 0 0
0 0 1 0 0 0
0 1 0 1 0 1
0 1 1 0 0 1
1 0 0 0 1 0
1 0 1 0 1 0
1 1 0 1 1 1
1 1 1 0 1 1
X
Y
Z
X
Y
Z
X Y and (not Z) Y X Z Y
9
Learning Boolean Networks for Gene Expression
  • Assumptions
  • Deterministic (wiring does not change)
  • Synchronized update
  • All Boolean functions are probable
  • Data needed 2N for N genes. (In comparison, N
    needed for linear models)
  • General techniques limit the of inputs per
    gene (k). Data required reduced to 2k log(N).

10
Learning Boolean Networks
  • Consistency Problem
  • Given Examples S ltIn, Outgt, where
  • In ?0,1k, output ? 0,1
  • Goal learn Boolean function f such that for
    every ltIn, Outgt ? S, f(In) out.
  • Note
  • Given the same input, the output is unique.
  • For k input variables, there are at most 2k
    distinct input configurations.
  • Example
  • lt001,1gt lt101,1gt lt110,1gt lt010,0gt lt011,0gt lt101,0gt
  • 1,1 5,1 6,1 2,0
    3,0 5,0

11
Learning Boolean Networks
lt001,1gt lt101,1gt lt110,1gt lt010,0gt lt101,1gt lt101,0gt
?
  • no clash -gt consistency.
  • Question marks -gt undetermined elements
  • O (Mk), M is of experiments
  • N genes, Choose k from N,
  • N C(N, k) O(MK)

1
0
0
?

1
?
  • Best-fit problem Find a function f with minimum
    of errors
  • Limited error-size problem Find all functions
    with error-size within ?max

Lähdesmäki et al, Machine Learning 200352
147-167.
12
(No Transcript)
13
State space and attractor basins
14
What are some biological interpretations of
basins and attractors?
15
(No Transcript)
16
Linear Models
  • Expression level of gene at time t depends
    linearly on the expression levels of some genes
    at time t-1
  • Basic model Xi (t) Sj Wij Xj(t-1)
  • Xi (t) Sj Aij Xj(t), where Xi(t) can be
    measured, Xi (t) can be estimated from Xi(t)
  • In matrix form XN?M AN?N XN?M , where M is
    the number of time points, N is the number of
    genes

t-1
t
W11
X1
X1
W21
W31
X2
X2
W31
W32
X3
X3
W33
17
Linear Models (contd)
  • XN?M AN?N XN?M
  • AN?N connectivity matrix, Aij describes the type
    and strength of the influence of the jth gene on
    the ith gene.
  • To solve A, need to solve MN linear equations
  • In general N2 gtgt MN, therefore under-determined
    gt infinity number of solutions

18
Get Around The Curse of Dimensionality
  • Non-linear interpolation to increase of time
    points
  • Cluster genes to reduce of genes
  • Singular Value Decomposition (SVD)
  • A A0 CN?N VTN?N, where cij 0 if j gt M
  • Take A0 as a solution, guaranteed smallest sum of
    squares.
  • Robust regression
  • Minimize of edges in the network
  • Biological networks are sparse (scale-free)

CN?N
Cij
0
19
Robust Regression
  • A A0 CN?N VTN?N,
  • Minimizing of non-zero entries in A by
    selecting C
  • Set A 0, then C VT -A0 , solve for C.
  • Over-determined. (N2 equations, MN free
    variables).
  • Robust regression
  • Fit a hyper-plane to a set of points by passing
    as many points as possible

20
Simulation Experiments
SVD alone
SVD Robust Regression
Yeung et al, PNAS. 2002996163-8.
21
Simulation Experiments (contd)
Nonlinear System close to steady state
Linear System
  • Does not work for nonlinear system not close to
    steady state
  • Scale-free property does not hold on small
    networks

22
Bayesian Networks
  • A DAG G (V, E), where
  • Vertex a random variable
  • Edge conditional distribution for a variable,
    given its parents in G.
  • Markov assumption
  • ?i, I (Xi, non-descendent(Xi) PaG(Xi))
  • e.g. I(X3, X4 X2), I(X1, X5 X3)

X1
X2
X4
X3
X5
Chain rule P(X1, X2, , Xn) ?i P(Xi PaG(Xi),
i 1..n
P (X1, X2, X3, X4, X5) P(X1) P(X2) P(X3 X1,
X2) P (X4 X2) P(X5 X3)
Learning argmaxG P (G D) P (D G) P (G) /
C
23
Bayesian Networks (Contd)
  • Equivalence classes of Bayesian Networks
  • Same topology, different edge directions
  • Can not be distinguished from observation
  • Causality
  • Bayesian network does not directly imply
    causality
  • Can be inferred from observation with certain
    assumptions
  • no hidden common cause

C
C
A
B
A
B
I (A, B C)
C
PDAG
A
B
Hidden variable
C
A
B
24
Bayesian Networks for Gene Expression
  • Deals with noisy data well, reflects stochastic
    nature of gene expression
  • Indication of causality
  • Practical issues
  • Learning is NP-hard
  • Over-fitting
  • Equivalent classes of graphs
  • Solution
  • Heuristic search, sparse candidate
  • Model averaging
  • Learning partial models

Gene E
Gene C
Gene D
Gene A
? (D E) Multinomial or linear
Gene B
Other variables can be added, such as promoters
sequences, experiment conditions and time.
25
Learning Bayesian Nets
  • Find G to maximize Score (G D), where
  • Score(G D) Si Score (Xi, PaG(Xi) D)
  • Hill-climbing
  • Edge addition, edge removal, edge reversal
  • Divide-and-conquer
  • Solve for sub-graphs
  • Sparse candidate algorithm
  • Limit the number of candidate parents for each
    variables. (Biological implications sparse
    graph)
  • Iteratively modifying the candidate set

26
Partial Models (Features)
  • Model Averaging
  • Learn many models, common sub-graphs will be more
    likely to be true
  • Confidence measure of times a sub-graph
    appeared
  • Method bootstrap
  • Markov relations
  • A is in Bs Markov blanket iff

A and B in some joint biological interaction
A
B
A
B
or
C
  • Order relations


A is a cause of B
A
B
27
Experimental Results
Markov Relations
  • Real biological data set Yeast cell cycle data
  • 800 genes, 76 experiments, 200-fold bootstrap
  • Test for significance and robustness
  • More higher scoring features in real data than in
    randomized data
  • Order relations are more robust than Markov
    relations with respect to local probability
    models.

Friedman et al, J Comput Biol. 20007601-20
28
Transcriptional regulatory network
  • Who regulates whom?
  • When?
  • Where?
  • How?

TF
Gene
Promoter
A and not B
RNA-Pol
RNA-Pol
A or B
B
A
g1
A
B
g3
Not (A and B)
RNA-Pol
RNA-Pol
A and B
A
B
A
B
g4
g2
PNAS 2003100(9)5136-41
29
Data-driven vs. model-driven methods
condition
clustering
gene
MF
Descriptive
Learning
model
Post-processing
Biological insights
Explanatory, predictive
A description of a process that could have
generated the observed data
30
Data-driven approaches
Clustering
Motif finding
Genes
MEME, Gibbs, AlignACE,
Hierarchical, K-means,
Experiments
  • Assumption
  • Co-expressed genes are likely co-regulated not
    necessarily true
  • Limitations
  • Clustering is subjective
  • Statistically over-represented but non-functional
    junk motifs
  • Hard to find combinatorial motifs

31
Model-based approaches
  • Intuition find motifs that are not only
    statistically over-represented, but are also
    associated with the expression patterns
  • E.g., a motif appears in many up-regulated genes
    but very few other genes gt real motif?
  • Model gene expression f (TF binding motifs, TF
    activities)
  • Goal find the function that
  • Can explain the observed data and predict future
    data
  • Captures true relationships among motifs, TFs and
    expression of genes

32
Transcription modeling
e f (m1, m2, m3, m4)
Variables
Motifs
Expression
Promoters
g1 g2 g3 g4 g5 g6 g7 g8
?
Gene labels
Assume that gene expression levels under a
certain condition are a function of some TF
binding motifs on their promoters.
33
Different modeling approaches
  • Many different models, each with its own
    limitations
  • Classification models
  • Decision tree, support vector machine (SVM),
    naïve bayes,
  • Regression models
  • Linear regression, regression tree,
  • Probabilistic models
  • Bayesian networks, probabilistic Boolean
    networks,

34
Decision tree
m1
e
m1 m2 m3 m4
g1 g2 g3 g4 g5 g6 g7 g8
yes
no
m2
m4
e f (m1, m2, m3, m4)
yes
yes
no
no
A
B
C
D
1, 2, 5
3, 6
4
7, 8
  • Tree structure is learned from data
  • Only relevant variables (motifs) are used
  • Many possible trees, the smallest one is
    preferred
  • Advantages
  • Easy to interpret
  • Can represent complex logic relationships

35
A real example transcriptional regulation of
yeast stress response
  • 52 genes up-regulated in heat-shock (postive)
  • 156 random irresponsive genes (negative)
  • 356 known motifs

Small tree only used 4 motifs All 4 motifs are
well-known to be stress-related RRPE-PAC
combination well-known
36
Application to yeast cell-cycle genes
Model network in Science, 2002298(5594)799-804
Network by our method Ruan et. al., BMC Genomics,
2009
37
Regression tree
e
m1 m2 m3 m4
m1
g1 g2 g3 g4 g5 g6 g7 g8
yes
no
e f (m1, m2, m3, m4)
m2
m4
no
yes
yes
no
e??2
e?2
0ltelt2
0gtegt?2
  • Similar to decision tree
  • Difference each terminal node predicts a range
    of real values instead of a label

38
Multivariate regression tree
  • Multivariate labels use multiple experiments
    simultaneously
  • Use motifs to classify genes into co-expressed
    groups
  • Does not need clustering in advance

m1
no
yes
e1
e2
e3
e4
e5
m1 m2 m3 m4
g1 g2 g3 g4 g5 g6 g7 g8
m2
m4
yes
no
yes
no
3 6 8
1 2 5
7
4
Phuong,T., et. al., Bioinformatics, 2004
39
Modeling with TF activities
  • Gene expression f (binding motifs, TF
    activities)

g f (tf1, tf2, tf3, tf4)
tf1 tf2 tf3 tf4
g
rotate
e1 e2 e3 e4 e5
Soinov et al., Genome Biol, 2003
40
A Decision Tree Model
Segal et al. Nat Genet. 2003,34(2)166-76.
A decision tree model of gene expressions
gene
experiment
41
Algorithm BDTree
  • Gene expression f (binding motifs, TF
    activities)
  • Ruan Zhang, Bioinformatics 2006
  • Basic idea
  • Iteratively partition an expression matrix by
    splitting genes or experiments
  • Split of genes is according to motif scores
  • Split of conditions is according to TF expression
    levels
  • The algorithm decides the best motifs or TFs to
    use

42
Transcriptional regulation of yeast stress
response
  • 173 experiments under 20 stress conditions
  • 1411 differentially expressed genes
  • 1200 putative binding motifs
  • Combination of ChIP-chip data, PWMs, and
    over-represented k-mers (k 5, 6, 7)
  • 466 TFs

43
Genes
Experiments


44
Biological validation
  • Most motifs and TFs selected by the tree are
    well-known to be stress-related
  • E.g., motifs RRPE, PAC, FHL1, TFs Tpk1 and Ppt1
  • 42 / 50 blocks are significantly enriched with
    some Gene Ontology (GO) functional terms
  • 45 / 50 blocks are significantly enriched with
    some experimental conditions

45
RRPE PAC, ribosome biogenesis (60/94, p lt e-65)
RRPE only, ribosome biogenesis (28/99, p lt e-18)
FHL1, protein biosynthesis (98/105, plte-87)
STRE (agggg) carbohydrate metabolism p lt e-20
PAC
Nitrogen metabolism
46
Relationship between methods
c1 c2 c3 c4 c5
B
  • A, C from promoter to expression
  • A single cond
  • C multi conds
  • B, D from expression to expression
  • B single gene
  • D multi genes

t1 t2 t3 t4
m1 m2 m3 m4
A
g1 g2 g3 g4 g5 g6 g7 g8
D
C
Write a Comment
User Comments (0)
About PowerShow.com