Title: Genomic Signal Processing: Issues in Engineering Molecular Medicine
1Genomic Signal Processing Issues in Engineering
Molecular Medicine
- Edward R. Dougherty
- Department of Electrical and Computer
Engineering, Texas AM University - Division of Computational Biology, Translational
Genomics Research Institute - Department of Pathology, University of Texas,
M.D. Anderson Cancer Center
2Systems Medicine
- Systems Biology Understanding the manner in
which the parts of an organism interact in
complex networks. - Systems Medicine Translation of systems biology
into medicine. - Translational Genomics The part of systems
medicine that deals with genome-based systems
engineering.
3Goals of Translational Genomics
- Screen for key genes and gene families that
explain specific cellular phenotypes (disease). - Use genomic signals to classify disease on a
molecular level. - Build model networks to study dynamical genome
behavior and derive intervention strategies to
alter undesirable behavior.
4Translational Genomics Tools
- Signal Processing
- Pattern Recognition
- Information Theory
- Control Theory
- Network Theory
- Communication Theory
5Genomic Signal Processing
- GSP The analysis, processing, and use of genomic
signals for gaining biological knowledge and
translation of that knowledge into systems-based
applications. - Signals generated by the genome must be processed
to characterize their regulatory effects and
their relationship to changes at both the
genotypic and phenotypic levels.
6Central Dogma of Molecular Biology
DNA
Transcription
RNA
Translation
Protein
7Transcription Factors
8Gene Regulation
E1A
Rb
Gene regulatory controls
DNA damage
Myc
E2F
p53
MDM2
Hypoxia
transcription
Gene expression the process by which gene
products (proteins) are made
translation
protein
9Gene Expression
- Central Dogma of Molecular Biology Information
flows from DNA to RNA to protein. - Transcription DNA ? RNA
- Translation RNA ? protein
- It is not possible to fully separate the three
levels. - But the high level of interaction insures a
significant amount of system information present
at each level. - Measure gene expression by mRNA abundance.
10Microarrays
- Expression microarrays result from a complex
biochemical-optical system incorporating robotic
spotting and computer image formation and
analysis. - They facilitate large-scale surveys of gene
expression in which transcript levels can be
determined for thousands of genes simultaneously. - cDNA Arrays Expressed Sequence Tags (ESTs).
- Oligo Arrays Synthetic oligonucleotides.
- Involve image processing and signal extraction.
11Microarray Process
12Classification of Diseases
- Find a feature set of expression profiles to
classify disease. - Diagnose cancer
- Type
- Stage
- Prognosis
13BRCA Classification
14Small-Sample Issues
- Imprecise classifier design designed classifier
can be a poor estimate of the optimal classifier. - Poor error estimation owing to no test data.
- Poor feature selection.
- Dougherty, E. R., "Small Sample Issues for
Microarray-Based Classification," Comparative and
Functional Genomics, Vol. 2, 28-34, 2001. - Dougherty, E. R., Datta, A., and C. Sima,
Dougherty, E. R., Datta, A., and C. Sima,
Research Issues in Genomic Signal Processing,
IEEE Signal Processing Magazine, 22 (6), 46-68,
2005.
15Classifier Design
- From a sample form an estimate ?n of ?opt.
- Design cost ?n ?n ? ?opt
- Key issue good design often requires large
samples and it is often impossible to get large
enough samples to sufficiently reduce E?n.
16Overfitting
- If we apply a complex classification rule to a
small sample, the rule is likely to conform to
the data too closely. - We constrain classifier complexity to avoid
overfitting, thereby restricting ourselves to
easy problems.
17Constraint
- To lower design cost, optimization is constrained
to a subclass C. - Constraint cost ?C ?C ? ?d.
- The savings in design error must exceed the cost
of constraint. - Key problem find appropriate constraints.
- A constraint may be defined in accordance with a
model, or maybe experience has shown a certain
constraint works well in a given setting.
18Classifier Design Error
19Small-Sample Error Estimation
- Train and test classifier on same data.
- Basic Approaches
- Resubstitution Count errors on training data
(usually low biased). - Re-sampling Design on sub-samples and test on
left-out data. - Regularization Enhance the data or estimate the
distribution.
20Cross-Validation
- Error rate estimated by iteratively leaving
out data points, testing on the deleted points,
and averaging. - Cross-validation unbiased in the following sense
- ExpectationCV estimate ? error ? 0
- This says little about the number we care about,
- ExpectationCV estimate ? error
- unless CV variance is small not for small
samples.
21Cross-validation Mythology
- Myth Cross-validation is good for small samples.
- CV is good for moderate to large samples because
it allows all data to be used for design. - ? Myth CV always outperforms resubstitution.
- Resub performs as well or better for estimating
predictor error in low connectivity Boolean
networks. - Resub can outperform CV for feature set ranking.
- Resubstitution is much faster to compute.
-
22Deviation Distributions
Experiment 1 (LDA, p2)
Experiment 3 (3NN, p2)
Experiment 5 (CART, p2)
Resubs
leave one out
cv10
cv5
cv10r
bbc
b632
- Braga-Neto, U. M., and E. R. Dougherty, Is
Cross-Validation Valid for Small-Sample
Microarray Classification, Bioinformatics, 20
(3), 374-380, 2004.
23Bolstered Error Estimation
- Estimate classifier error by spreading the data
via Bolstering Kernels - Error estimate results from integrating kernels
over the domain to which points should not be
included. - Braga-Neto, U. M., and E. R. Dougherty,
Bolstered Error Estimation, Pattern
Recognition, 37 (6), 1267-1281, 2004.
24Bolstering Properties
- Error can be computed via integration with closed
form for LDA and Monte Carlo integration
otherwise. - Choosing variance of bolstering kernel is key
because it affects both bias and variance of the
bolstered estimator. - A method for choosing the variance has been
proposed. - Resubstitution results from zero bolstering
variance.
25Deviation Distributions CART, 5 Genes
26Feature Selection Impacts Cross-Validation
- Feature selection increases the already large
deviation variance of cross-validation. - Coefficient of Relative Increase in Deviation
Dispersion - ?opt true error using best features.
- ?cv true error using selected features.
- Xiao, Y., Hua, J. and E. R. Dougherty,
Quantification of the Impact of Feature
Selection on Cross-validation Error Estimation,
EURASIP J. Bioinformatics and Systems Biology,
2007.
27How Many Features?
- Peaking Phenomenon Overfitting.
28Feature-Selection Problem
- Select a subset of k features from a set of n
features with minimum error among all subsets of
size k. - Cover and van Campenhout Theorem All k-element
subsets must be checked. - Heuristic suboptimal algorithms have been
proposed to circumvent the full combinatorial
search. - Issues
- Mathematical analysis of algorithms
- Impact of error estimation
- Impact of sample size
29Optimal Number of Features
- Optimal number of features depends on sample
size, classification rule and feature-label
distribution. - Top LDA, linear model, slightly correlated
features. - Bottom LDA, linear model, highly correlated
features. -
- Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R.
Dougherty, Optimal Number of Features as a
Function of Sample Size for Various
Classification Rules, Bioinformatics, 21(8),
1509-1515, 2005.
30Peaking Phenomenon is Nontrivial
- Peaking can be later for smaller samples.
- Top 3NN, nonlinear model, modestly correlated
features. - Bottom Linear SVM, nonlinear model, modestly
correlated features. -
- Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R.
Dougherty, Optimal Number of Features as a
Function of Sample Size for Various
Classification Rules, Bioinformatics, 21(8),
1509-1515, 2005.
31Impact of Error Estimation on Feature Selection
- Choice of error estimator can be more important
than choice of algorithm. - LDA, Gaussian model, n 50, 5 features from 20.
- Sima, C., Attoor, S., Braga-Neto, U., Lowey, J.,
Suh, E., and E. R. Dougherty, Impact of Error
Estimation on Feature-Selection Algorithms,
Pattern Recognition, 38 (12), 2472-2482, 2005.
32What Can We Expect from Feature Selection?
- Top Regression of selected FS error on best FS
error. - Bottom Regression of best FS error on selected
FS error. -
- Sima, C., and E. R. Dougherty, What Should One
Expect from Feature Selection in Small-Sample
Settings, Bioinformatics, 22 (19), 2430-2436,
2006.
33Decorrelation of True and Estimated Errors
- With feature selection, the problem is
decorrelation of the error estimate from the true
error, not increased estimator variance. - Selecting 5 features from 200 with sample size
50. - With feature selection
- Without feature selection
- Hanczar, B., Hua, J., and E. R. Dougherty, Is
There Correlation between the Estimated and True
Classification Errors in Small-Sample Settings?
IEEE Statistical Signal Processing Workshop,
Madison, August, 2007.
34Error Bounds
- Distribution-free bounds exist on the RMS between
the error and error estimate. - Typically, they are useless for small samples.
- For n 100, RMS ? 0.435.
35Salient Points for Small Samples
- Beware of complex classifiers.
- Keep feature sets small.
- Avoid cross-validation where possible.
- Recognize the heavy influence of the
feature-label distribution and classification
rule. - Report a list of classifiers and feature sets for
analysis. - Issues Analysis of classifier and
feature-selection performance - Better error estimation
- Mathematical analysis of error estimators
- Braga-Neto, U., and E. R. Dougherty, Exact
Performance of Error Estimators for Discrete
Classifiers, Pattern Recognition, 38 (11)
1799-1814, 2005.
36Is Knowledge Possible?
- The scientific meaning of a classifier and its
error estimate relate to the properties of the
error estimator. - Choice 1 Estimate population density
impossible. - Choice 2 Distribution-free error bounds
useless. - Answer Model-Based Analysis
- Pattern Recognition ? Data Mining
- Knowledge is possible with proper epistemology.
37Apparent Clusters in Microarray Data
Relationship?
patterns
38What Are Good Clusters?
- Example
- 2 or 3 clusters?
- What is the best separation?
39The Clustering Problem
- Apply a clustering algorithm to data and form
clusters, as every clustering algorithm does. - Say, Gee Whiz! There are known related genes in
a cluster. - Where is the possibility for verification by
prediction? Indeed, what is to be verified?
40Clustering and Scientific Knowledge
- A scientific theory requires a model and a
predictive methodology to test model validity. - Classification
- Model classifier and error
- Validity rests on the accuracy of error
estimation - Model inferred (learned from data)
- Clustering as historically used
- Model (algorithm)
- No framework for predictive model testing
- No learning
-
-
41Probabilistic Theory of Clustering
- Clustering theory in the context of random sets
- Probabilistic error measure based on points being
clustered correctly - Bayes clusterer (optimal clustering algorithm)
- Learning theory for clustering algorithms
- Dougherty, E. R., and M. Brun, A Probabilistic
Theory of Clustering, Pattern Recognition, 37
(5), 917-925, 2004.
42Example of Clustering Error
- Left Realization of point process
- Right Output of hierarchical clustering
- Error 40
43Validation Indices
- Validation indices are meant to judge the
validity of a clustering output. - They can be based on a number of heuristic
considerations and methodologies. - Do they correspond to scientific validity?
- Do validation indices correlate to clustering
error? - Brun, M., Sima, C., Hua, J., Lowey, J., Carroll,
B., Suh, E., and E. R. Dougherty, Model-Based
Evaluation of Clustering Validation Measures,
Pattern Recognition, 40 (3), 807-824, 2007.
44Kendalls Correlation for Indices
- Top Realization of point process
- Bottom Kendalls correlation for different
indices across different clustering algorithms
45Regulatory Modeling
- Find analytical tools for genomic data that can
detect multivariate influences on decision-making
produced by complex genetic networks. - Construct the minimal complexity network that can
model sufficient information transfer to achieve
goal. - Less computation
- Less data required for inference
- Given a model, discover ways to intervene in its
dynamics to obtain desired behavior.
46Gene Interaction
- Genes interact via multi-protein complexes,
feedback regulation, and pathway networks. - Complex molecular networks underlie biological
function. - Most diseases do not result from a single gene
product. - These interrelationships among genes constitute
gene regulatory networks.
47Muscle Network (Drosophila)
- A gene network shows regulatory interaction.
- msp-300 is a hub gene that regulates genes
encoding motor proteins responsible for muscle
contraction. - Zhao, W., Serpedin, E., and E. R. Dougherty,
Inferring Gene Regulatory Networks from Time
Series Data Using the Minimum Description Length
Principle, Bioinformatics, 22 (17, 2129-2135,
2006.
48Desirable Model Properties
- Incorporate rule-based dependencies between
genes. - Rule-based dependencies may constitute important
biological information. - Allow systematic study of global network
dynamics. - In particular, individual gene effects on
long-run network behavior. - Cope with uncertainty.
- Small sample size, noisy measurements, robustness
- System must be open to external latent variables
49Infer Regulatory Genetic Function?
If gene X1 is active and gene X2 is suppressed,
gene Y would be activated
Can we infer regulatory genetic function from the
cDNA microarray data, for both known and unknown
functions?
50Inference From Data
- Key issues
- Complex model
- Limited data
- Lack of appropriate time-course data for dynamics
- Fundamental Principle Use simplest model that
provides sufficient information to accomplish the
task at hand and which is compatible with the
data. - Formalize inference by postulating criteria that
constitute a solution space for the inverse
problem. - Constraint criteria are composed of restrictions
on the form of the network biological,
complexity. - Operational criteria are composed of relations
that must be satisfied between the model and the
data.
51Regulatory Logic
- Jacques Monod The logic of biological
regulatory systems abideslike the workings of
computers, by the propositional algebra of George
Boole. - Shmulevich I., and E. R. Dougherty, Genomic
Signal Processing, Princeton University Press,
Princeton, 2007.
52Boolean Predictive Relationships
- Boolean Relationships in the NCI 60 ACDS
(Anti-Cancer Drug Screen). - MRC1 VSNL1 ? HTR2C
- SCYA7 CASR ? MU5SAC
- Capture switch-like (ON/OFF) behavior.
- Pal, R., Datta, A., Fornace, A. J., Bittner, M.
L., and E. R. Dougherty, Boolean Relationships
Among Genes Responsive to Ionizing Radiation in
the NCI 60 ACDS, Bioinformatics, 21(8),
1542-1549, 2005.
53Basic Structure of Boolean Networks
1 means active/expressed 0 means
inactive/unexpressed
A
B
Boolean function A B X 0 0 1 0 1 1 1 0 0 1
1 1
X
In this example, two genes (A and B) regulate
gene X. In principle, any number of input genes
are possible. Positive/negative feedback is also
common (and necessary for homeostasis).
54Network Dynamics
A
B
C
D
E
F
Time
0
1
1
0
0
1
At a given time point, all the genes form a
genome-wide gene activity pattern (GAP). Consider
the state space formed by all possible GAPs.
55State Space of Boolean Networks
- Similar GAPs lie close together.
- There is an inherent directionality in the state
space. - Some states are attractors (or limit-cycle
attractors). The system may alternate between
several attractors. - Other states are transient.
Picture generated using the program DDLab.
56Probabilistic Boolean Networks
- A PBN is composed of a collection of BNs.
- At any time point, state transitions are
controlled according to one of the BNs. With
some probability, the PBN can switch to a
different BN at a time point. - So long as there is no switch the PBN acts like a
BN. - Allows for random gene perturbations.
- Shmulevich, I., Dougherty, E. R., Kim, S., and W.
Zhang, Probabilistic Boolean Networks A
Rule-based Uncertainty Model for Gene Regulatory
Networks, Bioinformatics, 18, 261-274, 2002. - Shmulevich, I., Dougherty, E. R., and W. Zhang,
From Boolean to Probabilistic Boolean Networks
as Models of Genetic Regulatory Networks,
Proceedings of the IEEE, 90(11), 1778-1792, 2002.
57PBN State Space
Attractors BN 2
BN 1
BN 2
Attractors BN 1
58Properties of PBNs
- Share the rule-based properties of Boolean
networks. - Models uncertainty.
- Dynamic behavior studied via Markov Chains.
- Close relationship to Bayesian networks.
- Attractors of a PBN are the attractors of the
constituent BNs. - Can leave a BN attractor cycle when BN switches.
- Brun, M., Dougherty, E. R., and I. Shmulevich,
Steady-State Probabilities for Attractors in
Probabilistic Boolean Networks, Signal
Processing, in press, 2005. - Lahdesmaki, H., Hautaniemi, S., Shmulevich, I.,
and Yli-Harja, O., Relationships Between
Probabilistic Boolean Networks and Dynamic
Bayesian Networks as Models of Gene Regulatory
Networks, Signal Processing, in press, 2005.
59Various Design Methods Proposed
- Find genes with predictive capability for target
gene (CoD). - Use mutual-information to find related genes.
- Use MDL principle.
- Optimize connectivity in a Bayesian framework
relative to the gene profiles in the data. - Find networks satisfying biologically related
constraints such as limited attractor structure,
transient time, and connectivity. - Assuming steady-state data, require data states
to be attractors. - Assuming biological determinism within a given
cellular context, design a PBN under the
assumption that constituent BNs produce
consistent data subsets in the sample data.
60Network Reduction
- Network reduction is often desirable for
computational reasons the state space is too
large. - Delete genes reconstruct connectivity and rules.
- Preserve probability structure.
- Dougherty, E. R., and I. Shmulevich, Mappings
Between Probabilistic Boolean Networks, EURASIP
J. Signal Processing, 83 (4), 799-809, 2003. - Preserve steady-state distribution.
- Ivanov, I., and E. R. Dougherty, Reduction
Mappings Between Probabilistic Boolean Networks,
EURASIP J. Applied Signal Processing, 4 (1),
125-31, 2004. - Preserve dynamical structure.
- Ivanov, I., Pal, R., and E. R. Dougherty,
Reduction Mappings between Probabilistic Boolean
Networks that Preserve Dynamical Structure, IEEE
Trans. Signal Processing, to appear.
61Intervention
- A key goal of network modeling is to determine
intervention targets (genes) such that the
network can be persuaded to transition into
desired states. - We desire genes that are the best potential
lever points in the sense of having the
greatest possible impact on desired network
behavior. - Shmulevich, I., Dougherty, E. R., and W. Zhang,
Gene Perturbation and Intervention in
Probabilistic Boolean Networks, Bioinformatics,
18, 1319-1331, 2002.
62Dynamics
- Dynamics of PBNs can be studied using Markov
Chain theory. - We can ask the question In the long run, what
is the probability that some given gene(s) will
be ON/OFF?
63Medical Benefits of Network Intervention
- Prediction of new targets based on pathway
context. - Stress and toxic response mechanisms.
- Off-target effects of therapeutic compounds.
- Characterization of disease states by dynamic
behavior. - Gene- and protein-expression signatures for
diagnostics. - Regulatory analysis for therapeutic intervention.
64Possible Intervention Goals
- Minimize the mean first passage time to a
desirable state. - Maximize the probability of reaching a desirable
state before a certain fixed time. - Minimize the time needed to reach a desirable
state with a given fixed probability. - Shmulevich, I., Dougherty, E. R., and W. Zhang,
Gene Perturbation and Intervention in
Probabilistic Boolean Networks, Bioinformatics,
Vol. 18, 1319-1331, 2002. - Shmulevich, I., Dougherty, E. R., and W. Zhang,
Control of Stationary Behavior in Probabilistic
Boolean Networks by Means of Structural
Intervention, Biological Systems, Vol. 10,
431-446, 2002.
65Where and How to Intervene?
suppress or activate?
MBP-1
IAP-1
FRA-1
p21
ATF3
SSAT
REL-B
MDM2
PC-1
BCL3
p53
RCH1
66External Control
- Consider an external control variable and a cost
function depending on state desirability and cost
of action. - Minimize the cost function by a sequence of
control actions over time control policy. - Application Design optimal treatment regime to
drive the system away from undesirable states. - Datta, A., Choudhary, A., Bittner, M. L., and E.
R. Dougherty, External Control in Markovian
Genetic Regulatory Networks, Machine Learning,
52, 169-181, 2003. - Pal. R., Datta, A., and E. R. Dougherty, Optimal
Infinite Horizon Control for Probabilistic
Boolean Networks, IEEE Transactions on Signal
Processing, 54 (6), 2375-2387, 2006. -
67Optimal Control
- Key Objective Optimally manipulate the external
controls to move the GAP from an undesirable
pattern to a desirable pattern. - Use available information, e.g., phenotypic
responses, tumor size, etc. - Require a paradigm for modeling the evolution of
the GAP under different controls. - PBN is one such paradigm.
- Use the associated Markov chain.
68Control in PBNs
- Transition Probabilities depend on external
control inputs e.g. chemotherapy, radiation, etc. - Assume m control inputs u1,u2......, um.
- Each input can take on the values 0 ( not
applied) or 1 (applied). - The values of the control inputs can be changed
with time.
69Control Setting
- Control input vector at time k u1(k)
,......,um(k) - Change both GAP and control vector to integers,
z(k) and v(k). - Then z(k) and v(k) can take on 2m values.
- We have a system
- w(k 1) w(k)A(v(k))
- A is a stochastic matrix dependent on the control
input. - We have a controlled homogeneous Markov chain.
70Costs of Applying Control
- Choose v(0), v(1), ..... to minimize a particular
cost function. - Choice of cost function?
- Consider finite treatment horizon k 0, 1,,M
1. - Let Ck(z(k),v(k)) denote the cost of applying
control v(k) at state z(k). (Input from
biologists) - Cost of control over M 1 time steps
71Terminal Costs
- Net result of control action ends up in z(M).
- Penalize z(M) in the cost to reduce chances of
ending up in an undesirable state. - Define CM(z(M)) to be the terminal cost of ending
up in state z(M). - Partition states into equivalence classes.
- Assign higher penalties to states associated with
rapid cell proliferation or reduced apoptosis and
lower penalties for states associated with normal
cell cycle. (Input from biologists)
72Total Cost
73Optimal Control Problem
74WNT5A Network
- Up-regulated WNT5A associated with increased
metastasis. - Cost function penalizes WNT5A being up-regulated.
- Optimal control policy with Pirin as control
gene.
75Shift of Steady-State Distribution
- Optimal (infinite horizon) control with pirin has
shifted the steady-state distribution to states
with WNT5A down-regulated (a) with control (b)
without control.
76Robust Control
- Owing to model uncertainty, we desire control
policies that perform well under perturbations
from model. - Performance bounds design policy on estimated
transition matrix, P, and bound difference
between the true and estimated controlled
steady-state distribution - ?c ? ?c 1 ? kP ? P row-max
- Pal, R., Datta, A., and E. R. Dougherty, Robust
Intervention in Probabilistic Boolean Networks,
IEEE Transactions on Signal Processing, to
appear.
77Robust Control Policies
- Mini-Max Robust Policies Find optimal worst-case
performance over uncertainty class of networks. - Pal, R., Datta, A., and E. R. Dougherty,
Robustness of Intervention Strategies for
Probabilistic Boolean Networks, IEEE GENSIPS,
Tuusula, June, 2007. - Bayesian Robust Policies Find optimal policy
relative to a prior distribution on the
uncertainty class. - Pal, R., Datta, A., and E. R. Dougherty,
Bayesian Robustness in the Control of Gene
Regulatory Networks, IEEE Statistical Signal
Processing Workshop, Madison, August, 2007.
78Collaborators
- Translational Genomics Research Institute
(Arizona) - University of Texas M. D. Anderson Cancer
Institute - Tampere University of Technology (Finland)
- University of Sao Paulo (Brazil)
- Strathclyde University (Scotland)
- Columbia University
- Acknowledgements
- National Science Foundation
- National Cancer Institute
- National Human Genome Research Institute
- W. M. Keck Foundation
79References
- Genomic Signal Processing and Statistics, eds. E.
R. Dougherty, I. Shmulevich, J. Chen, J. Wang,
Hindawi Press, 2005. - Genomic Signal Processing, I. Shmulevich and E.
R. Dougherty, Princeton University Press, 2007. - Introduction to Genomic Signal Processing and
Control, A. Datta and E. R. Dougherty, CRC Press,
2007. - Dougherty, E. R., and A. Datta, Genomic Signal
Processing Diagnosis and Therapy, IEEE Signal
Processing Magazine, 107-112, 22 (1), 2005. - Dougherty, E. R., Datta, A., and C. Sima,
Research Issues in Genomic Signal Processing,
IEEE Signal Processing Magazine, 22 (6), 46-68,
2005. - Dougherty, E. R., Shmulevich, I., and M. L.
Bittner, Genomic Signal Processing The Salient
Issues, EURASIP J. Applied Signal Processing, 4
(1), 146-153, 2004.