Title: A Bayesian Statistical Approach to Modeling Gene Regulatory Pathways in Human Placental Data
1A Bayesian Statistical Approach to Modeling Gene
Regulatory Pathways in Human Placental Data
- Elinor Velasquez
- Dept. of Biology
- San Francisco State University
2Outline of talk
- Introduction
- The experimental approach Obtaining placenta
data - The experimental approach Modeling gene
regulatory networks - Results from experiments
- Conclusions and future work
- Acknowledgements
3Introduction
4Overall goal
To use a bioinformatics model for which to better
understand the human placenta
http//www.biotechnologycenter.org/hio/assets/hisi
mages/placenta/placenta44.jpg
5The human placenta
http//www.uchsc.edu/winnlab/index.html
6The basal plate in the placenta
Site of known anatomical abnormalities in
preeclampsia
http//www.uchsc.edu/winnlab/projects.html
7EGFR pathway
- EGFR, cell surface receptor for epidermal growth
factors - Potentially important gene for the placenta
British Journal of Cancer (2006) 94, 184 188
8EGFR regulates gene expression
EGFR
CSPG2
DCN
ANGPT2
9Causal relationships
EGFR
CSPG2
DCN
ANGPT2
10Example of a gene regulatory network
Gene 1
Gene 3
Gene 2
Gene 5
Gene 6
Gene 4
11Definition of a Bayesian network
- There exist nodes (disks)
- There are edges (arrows) between some of the
nodes - Causality is implied by the edges
- Acyclic
Gene 1
Gene 3
Gene 2
Gene 5
Gene 6
Gene 4
12The experimental approach Obtaining placenta data
13Data collected from microarrays
- Data comes from 36 experiments conducted by
Virginia Winn et al. at the SJ Fisher lab, UCSF - Gene expression profiling experiments
cRNA
hybridization
45000 dots (25-mer oligo probe sets) representing
the human genome
14 Traditional spotted arrays
15What is a probe set?
- Several oligonucleotides designed to hybridize to
various parts of the mRNA generated from a single
gene
Probe set
mRNA
gene
16 Affymetrix GeneChips
17Microarray data
- The normalized log 2 intensity values were
centered to the median value of each probe set,
by Virginia Winn et al.
5 time segments
1 2
3 4 5
A probe set x1 ... x6 y1 ... y9 z1 ...z6
w1...w6 s1 ... s9
36 data points per probe set
18Microarray data
- Red denotes the up regulated expression and green
denotes the down regulated expression relative to
the median value - Genes differentially expressed in the basal plate
of placentas Rows contain data from a single
basal plate cRNA sample and columns correspond to
a single probe set.
http//www.uchsc.edu/winnlab/index.html
19Summary of data used in bioinformatics experiments
Gene egfr
- 36 placentas
- 45, 000 probe sets
- Time-series data from 14-16 weeks to term
20The experimental approach Modeling gene
regulatory networks
21Outline of bioinformatics experimental design
PS 1
PS 2
PS 4
PS 3
- Step 1. Create a naïve Bayesian network using the
probe set data - Step 2. Score the naïve Bayesian network
- Step 3. Randomly add/delete an edge and rescore
the Bayesian network - Step 4. Continue until best score reached
- Step 5. Combine probe sets to create the gene
regulatory network
22Four probe sets (Three genes)
23Define naïve Bayesian network
- Choose a root node
- All other nodes branch off of the root node
- PS1 is the parent node
PS 1
PS 2
PS 4
PS 3
24Step 1 Create a naïve Bayesian network using
probe set data
PS1
PS3
PS4
PS2
- Use data from one time segment
- Choose Weeks 23-24 data (6 placentas)
- Choose 4 probe sets
25Placenta data for Weeks 23-24
PS1 corresponds to 201984 which corresponds to
EGFR PS2 corresponds to 236034, PS3 corresponds
to 211148 PS2 and PS3
both correspond to ANGPT2 PS4 corresponds to
204620 which corresponds to CSPG2
26Step 2 Score the naïve Bayesian network
- We want to score this network
PS1
PS4
PS3
PS2
27The network score is a function of conditional
probabilities
- Conditional probability, Prob(N Pa(N)), is
the probability of child node N given parent of N - Example Given a parent PS1s node has an
associated expression value 10, what is the
probability that its child node, PS4, has an
expression value of 8?
PS1
PS4
28Conditional probability
PS1
- EGFR (PS1) is the parent node and
- has value 10.
- CSPG2 (PS4) is the child node and has
- value 8 two times
- Conditional probability 2/6
PS4
29Score for a Bayesian network
- The score of the naive network equals the
product of all the nonzero conditional
probabilities associated with the network -
- P(N1, N2, N3, N4) ? P(Ni pa(Ni))
4
i1
30Score for the naïve Bayesian network
- P(N1, N2, N3, N4) 1/3966
- 2.54 x 10-5
PS1
PS4
PS2
PS3
31Step 3 Randomly add/delete an edge and rescore
the Bayesian network
PS1
- The score becomes 1/78732 1.27 x 10-5.
PS2
PS3
PS4
32Step 4. Continue until best score reached
- Since the score is a probability, we want the
score to be high. - The naïve network is the better choice between
the two networks, so we pick it as our final
network.
PS1
PS4
PS2
PS3
33Step 5. Combine probe sets to create the gene
regulatory network
EGFR
CSPG2
ANGPT2
3440 probe sets (26 genes)
35Gene regulatory pathwayfor 26 genes
Step 1. Create a naïve Bayesian network using 40
probe sets for each time segment Step 2. Score
the naïve Bayesian network Step 3. Randomly
add/delete an edge and rescore the Bayesian
network Step 4. Continue until best score
reached Step 5. Combine probe sets to create the
gene regulatory network for the placenta
36Step 1. Create a naïve Bayesian network using
40 probe sets for each time segment
37Create a naïve Bayesian network
PS 7
PS 8
PS 6
PS 9
PS 1
PS 5
PS 2
PS 4
PS 3
38 Step 2. Score the naïve Bayesian network
39Score for a Bayesian network
- The score of the naive network equals the
product of all the nonzero conditional
probabilities associated with the network -
- P(N1, N2, N3, N4) ? P(Ni pa(Ni))
40
i1
40 - Step 3. Randomly add/delete an edge and rescore
the Bayesian network - Step 4. Continue until best score reached
41With four probe sets, at least two Bayesian
networks were constructed
PS1
PS1
PS2
PS3
PS4
PS2
PS3
PS4
42Exhaustive search
- To be certain that we have the best scoring
network, we need to construct all possible
networks from our naïve networks - With four probe sets, we only constructed one
other network than the naïve network - How to construct all possible networks?
43How do we construct all possible networks?
- 1 probe set 1 Bayesian network
- 2 probe sets 2 possible Bayesian networks
- 3 probe sets 12 possible Bayesian networks
- 4 probe sets 144 possible Bayesian networks
- 5 probe sets 4800 possible Bayesian networks!
- 6 probe sets ??
- And so on
44Welcome to Modern Heuristics
- Step 1. Representation of a model
- Step 2. The scoring function
- Step 3. Defining the search problem
- Step 4. Consider local optima
score
local change
45Step 1 Representation of the model
- The model is a gene regulatory pathway.
- We are going to assume a Bayesian model for our
probe set - The number of possible pathways is so large as to
forbid an exhaustive search for the best Bayesian
network.
PS 1
PS 2
PS 4
PS 3
46Step 2 The scoring function
- The fair coin, p(X heads) ½
- What happens if the coin is unfairly weighted?
- We need to re-think probability
- p(X) ?p(x) r(x) dx
- r(x) is a weight function.
47Step 2. The scoring function
- The scoring function is a probability
- Assume the network has a Dirichlet distribution
which is the weight function used to weight the
conditional probabilities.
www.wikipedia.com
48Step 2. The scoring function
- Probability of a fixed network equals product
of conditional probabilities times the Dirichlet
distribution
40
P(N) ? P(Ni pa(Ni)) D(Ni)
i 1
such that
D(Ni) ? Ti?i-1(N i)
49Step 3 Defining the search problem
- What it means to search
- a. Construct a first network (Use a naïve
Bayesian network) - b. Score the first network using the scoring
function - c. Perform the Hill-climbing algorithm.
-
50Step 3. Defining the search problem The
Hill-climbing Algorithm
- Randomly choose a node
- Search in the neighborhood of that node for the
best scoring network
51Step 4. Consider local optima
- Hill-Climbing is a traditional method for search
techniques - Can get caught on local maxima
- Step 4 is to keep choosing random nodes.
score
local change
randomly chosen node is the origin
From http//content.answers.com/
52Software
- Weka software package written by members of the
University of Waikato, New Zealand,
http//www.cs.waikato.ac.nz/ml/people.html - DEAL, R package, written by Susanne G. Bøttcher,
Claus Dethlefsen, http//www.math.auc.dk/novo/deal
- BayesNet Toolbox, Matlab package, written by
Kevin Murphy, http//www.cs.ubc.ca/murphyk/Softwa
re/BNT/bnt.html - ExpressionNet, written by Jingchun Zhu,
http//expressionnet.sourceforge.net/
53Results from experiments
5426 genes
55 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
Ingenuity network
USP6NL
PECAM1
IL2RB
CECAM1
CYP19A1
56Results for 26 genes
- 40 probe sets (26 genes)
- Data comes from five different time intervals
- 1. 14 16 gestational weeks
- 2. 18 19 gestational weeks
- 3. 21 gestational week
- 4. 23 24 gestational weeks
- 5. 37 40 gestational weeks
57 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
Time Segment Week 14-16 weeks
USP6NL
PECAM1
CYP19A1
IL2RB
CECAM1
58 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
Time segment 18 19 weeks
PECAM1
CYP19A1
IL2RB
CECAM1
59 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
Time segment 21 weeks
PECAM1
CYP19A1
IL2RB
CECAM1
60 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
Time segment 23 24 weeks
USP6NL
PECAM1
CYP19A1
IL2RB
CECAM1
61 COL5A2
COL5A1
COL3A1
DCN
COL1A2
CSPG2
SPP1
INHBA
ANGPT2
BAMBI
SFRP1
P4HA1
IGFBP1
RAP2B
PLAU
CCNG2
MRC2
GLB1
ATP5E
ADAM9
EGFR
ERG
USP6NL
Time segment 37 40 weeks
PECAM1
CYP19A1
IL2RB
CECAM1
62How to display data
- One of the most pressing questions in
bioinformatics research is how to display the
data effectively - We have two solutions
- 1. An interaction map
- 2. Geometrical considerations
63An interaction map for 26 genes
64Geometrical considerations
- Will illustrate with the gene egfr
- egfr is an epidermal growth factor
- Functions on the cell surface
- Activated by binding of its specific ligands
- Responsible for many pathways in animal models
65Gene egfr regulated by
66Genes on a dodecahedron Gene regulatory network
for egfr
CSPG2
CCNG2
COL1A2
On backside PECAM1 ANGPT2 IGFBP1 MRC2
SPP1 USP6NL
PLAU
INHBA
DCN
Adapted from http//www.math.cornell.edu/mec/2003
-2004/geometry/platonic/dodecahedron.jpg
67Conclusions
- We can predict gene regulatory networks using
Bayesian networks as an intermediate step - When we leave arrows in network, we are able to
show causal relationships between the genes - Interaction maps and use of geometry are novel
ways to display gene behavior
68Future Directions
- A three-dimensional viewer with numerical values
will be implemented to use with the Weka software
- Use molecular genetics techniques to validate a
portion of the results - Design a genetic programming algorithm
(evolutionary algorithm) to create a Bayesian
network
69Acknowledgements
- San Francisco State University
- Leticia Márquez-Magaña, Chris Smith, Frank
Bayliss, Juan Castellon, Ernesto Flores, Rebecca
Garcia, Alba Gutierrez, Jainee Lewis, Rebecca
Mendez, Cylyn Cruz, Jasmin Reyes, Jackie
Robinson, Peter Thorsen, My family - UC San Francisco
- Susan Fisher, Matthew Gormley
- M.B.R.S.-R.I.S.E. Grant 5 - R25-GM59298
70(No Transcript)