Title: Inference of Transcriptional Regulation Network with Gene Expression Data
1Inference of Transcriptional Regulation Network
with Gene Expression Data
2(No Transcript)
3Role of Proteins
- Both functional and structural
- Main agents of cellular functions
- Each protein has a specific function
- The amount of each protein in the cell must be
controlled carefully - Elaborate Regulatory Network
4Gene Regulatory Network
- Fundamental mechanism by which protein production
and cellular functions are controlled - Complex input-output system made of proteins and
genes for controlling cellular functions - Important for understanding of many important
problems, including medical ones
5Cell Cycle
- After certain amount of growth, cell divides into
two identical cells - Need to duplicate cellular components and equally
divide among progenitors - Different regulators act in different parts and
stages in concert to control cell cycle
6Types of Regulation
- Activation
- Increase in protein A leads to increase in gene
Bs transcription - Inhibition
- Increase in protein A leads to decrease in gene
Bs transcription - Not a simple binary relationship
- Many genes could act on a particular gene at once
- Complexes - Feedback and Self-Regulation
7Example of Regulatory Network
S phase control in yeast
8(No Transcript)
9Microarray
- Each spot contains a specific probe designed for
a single cDNA - When more cDNA binds to a spot, the red intensity
increases - Allow study of gene expression in large scale
10Which Genes Are Related?
- Goal to find out which pairs of genes have
direct regulatory relationship
11Correlation Method
- Standard correlation coefficient
- Widely used method for sequence similarity
comparisons - Tests for degree of linear relationship between
two variables - Cannot take into account the time delay involved
in gene regulation - Strongly favours global over local similarities
12Edge Detection Method (1)
- By Filkov et al.
- Focus on improving local similarity detection
- Scan through gene expression curves and determine
where major edges occur, and remove spurious
edges - Construct primary edges using local minima and
maxima - Filter out those edges whose height does not make
the pre-determined threshold
13Edge Detection Method (2)
- Group those edges with similar direction
- Now left with edges depicting the major features
only - compare the edge profiles between two genes by
summing up closely located edges from two genes
with the same direction
14Edge Detection Method (3)
- Scoring Formula
- d agreement of slopes of edges (-1 or 1)
- n number of edges
- a, b two genes being compared
- ? gap between edges
- ?max maximum allowable time difference between
two edges
15Edge Detection Method (4)
- Does not differentiate between the direction of
regulation - Cannot be used to find inhibitory relationships
- Allows for negative time delays between two
corresponding edges on the basis that there is
not enough data resolution - Detects strong local matches only
16Bayesian Networks
- Consists of two parts
- Directed Acyclic Graph (Structure of GRN)
- Set of parameters for the DAG (Statistical
Hypothesis) - DAG represents the causal relations among a set
of random variables (gene expression levels) - X causes Y if and only if there is a direct edge
from X to Y
17Bayesian Networks (2)
- Must learn the network using observed data
- Perform a series of conditional independence
tests and construct the most likely set of DAGs
based on the results - Assign a score to each DAG based on the sample
data, and search for the highest scoring one
18Bayesian Networks (3)
- Need large sample size for accuracy
- Representing Time
- Increases the number of variables dramatically,
if one is to represent the time in the bayesian
network - Dynamic Bayesian Network
- High complexity
19Event Method
- Need a method that balances between global and
local similarity - Need to make use of temporal evidence
- Need to account for directionality of regulation
- Need to be computationally efficient
20Hypotheses on Regulation
- Hypothesis 1 A activates B
- Rise in expression of A followed by rise in
expression of B - Fall in expression of A followed by fall in
expression B - Hypothesis 2 A inhibits B
- Rise in A followed by fall in B
- Fall in A followed by rise in A
- Time delay between 2 corresponding events
21Events
- Directional changes in expression profile
- State of gene expression at an instant
- 3 possible states
- Rise, Constant, Fall (R, C, F)
- Event state/type determined by the slope of the
expression profile
22Event Conversion
- Microarray data is quite noisy
- Perform smoothing to reduce noise before
calculating slopes - Select the flat region around slope of 0
- Classify into R, C, F based on the slope values
- Any value falling in the flat region ? C
- Result 2 event strings
23Event String Alignment
- Need to best match 2 event strings with noise and
time delay in mind - Use Needleman-Wunschs global sequence alignment
algorithm - Handling of time delay
- Events that do not occur at the same time may
still be related to each other - No negative time delay
24Scoring Matrix (1)
- Scoring Method for Event Method
R C F
R S(dT) 0 -ßS(dT)
C 0 0 0
F -ßS(dT) 0 aS(dT)
0 lt S(dT) 1 0 a 1, 0 ß 1
dT time delay between two events If dT lt 0,
match penalty 8
25Scoring Matrix (2)
- R-R matches weighted more than F-F matches
- Decreases in mRNA levels less indicative
- Any match with C assigned neutral score of 0
- C region of uncertainty
- Could be due to any number of reasons
- Penalty for R-F matches
- Scores function of time delay dT
26Example
27Event vs. Correlation
- Event scores high, but correlation scores low
- Time delay lowers the correlation coefficient
28Event vs. Edge Detection
- Event scores high, edge detection scores low
- Bolded edges what edge detection finds
- Only edges A and B are close enough to be added
to score
29Spellmans Data Sets
- Snapshots of yeast cellular mRNA levels at
regular time intervals using cDNA microarrays - 4 separate data sets based on different cell
arresting methods used - a-arrest, elutriation, CDC15, CDC28 temp.
sensitive mutants - Yeast genome 6200 genes
- Too many need to reduce search space
30Selecting Genes to Study
- Want to restrict to genes related to cell-cycle
regulation - Filkov et al searched for known transcriptional
regulation pairs in Yeast Proteome Database - 888 transcriptional regulations
- 486 genes
- 647 activations, 241 inhibitions
31Pre-Processing Data
- Microarray data by Spellman contains many missing
points - Experimental errors
- Use linear interpolation to fill in for the
missing points - If the ratio of the missing points to valid
points is greater than the threshold, ignore the
gene data in question
32Analysis of the Test Set (1)
- a and CDC28 data sets analyzed
Data Set ORFs Genes
a 4489 348
CDC28 6103 458
- Need to compare each gene with all the others
- gt120,000 comparisons for alpha
- gt200,000 comparisons for CDC28
33Analysis of the Test Set (2)
- Correlation and edge detection methods no
directionality of regulation - Only ½ as many comparisons as the event method
- To make comparison possible, remove
directionality aspect from the event method as
well
34Analysis Results (1)
- Overlapping results among 3 methods (all results)
Methods Alpha CDC28
Event Correlation 3367 2916
Event Edge 2081 3362
Correlation Edge 1989 2252
- a0.7, -ß 0.3 used for scoring matrix
- Top-10,000 rankings
35Analysis Results (2)
- Overlapping results among 3 methods (true
positive results only)
Methods Alpha CDC28
Event Correlation 11 9
Event Edge 0 0
Correlation Edge 0 0
- a0.7, -ß 0.3 used for scoring matrix
- Top-10,000 rankings
36Analysis Results (3)
- lt 1/3 of results by any 2 methods overlap
- Event method finds significantly different pairs
from the other methods - Very little overlap between true positives
- Consistent with the fact the 3 methods employ
different search strategies - Local vs. global similarity
37True () distribution for top-k results
CDC28 data set
Alpha data set
38Effects of Time Delay (1)
- Perform time-shifting experiments and see how
score changes
Gene 1 Gene 2 Correlation Edge Event
YDR225W YDR224C 0.94 0.30 13.41
YDR225W YDR224C-1 0.46 0.05 12.92
YDR225W YDR224C-2 -0.24 -0.46 11.98
YMR199W YPL256C 0.82 0.78 8.92
YMR199W YPL256C-1 0.40 0.39 8.64
YMR199W YPL256C-2 -0.19 -0.06 9.24
39Effects of Time Delay (2)
- Correlation coefficients drop rapidly as time
delay is introduced - Supports assertion that correlation cannot handle
time delay gracefully - Unexpected drop in edge detection scores
- Probably due to problem in finding significant
edges to compare
40Effects of Scoring Matrix Parameters
a -ß Alpha Act. Alpha Inh. CDC28 Act. CDC28 Inh.
0.7 0.7 62 20 72 20
0.7 0.5 62 20 72 20
0.7 0.3 71 20 93 24
0.5 0.7 62 21 73 26
0.5 0.5 62 21 73 25
0.5 0.3 72 22 92 24
0.3 0.7 62 16 72 24
0.3 0.5 62 16 72 24
0.3 0.3 71 20 87 21
41Problems with Results
- Many genes shared identical expression curves,
incl. unrelated genes - Poor resolution of data
- Edge detection method
- Too many scores of 0
- Simply cannot find enough edges
- Significance of scores doubtful
42More Notes on Edge
- Cumulative Distribution Function for Edge
- Zero scores make up the vertical column
43Synthetic Data Sets (1)
- Spellmans data sets not enough to test the
algorithms properly - 4 different data sets
- Constant time delay
- Irregular time delay
- Partial matching
- Differential weighting of events
44Synthetic Data Sets (2)
- Each data set consists of equal number of gene
profiles and random profiles - Gene profiles genei
- Random profiles randomi
- genei and geneix related
- Better match if x is smaller
45Synthetic Data Sets (3)
Data Set Correlation Event
Constant Time Delay 31.6 39.8
Irregular Time Delay 27.2 33.8
Partial Matching 44.6 40.6
Differential Weighting 36.2 45.0
- Event method superior except in partial matching
- Could not test edge detection method
- Could not produce non-zero scores
46Summary
- Event Method find potential regulatory pairs
from gene expression data - Based on key features of gene expression
- Computationally efficient
- Perform comparably to correlation and edge
detection methods in finding true () from
Spellmans data sets - Outperform correlation in synthetic data sets
47Future Work (1)
- Limitation of real-world data
- Obtain data with better resolution
- Integrate data with other a priori knowledge
- Narrow down focus to transcription factors
- More realistic synthetic data
- Realistic modeling of artificial regulatory
network
48Future Work (2)
- Transitive Closure
- It would make sense to remove E13 from the pair
rankings in order to accommodate other potential
pairs
1
If E12 and E23 have higher scores than E13, Node
3 would be only conditionally dependent on Node 1
3
2
49Future Work (3)
- Improvement of event method
- Different number of event types
- Global regulatory network
- Combine pairings by event method to form
potential networks - Other uses for event method
- Different types of data, such as proteins
- Adaptation to other fields may be possible