Loading...

PPT – Genomic Signal Processing: Issues in Engineering Molecular Medicine PowerPoint presentation | free to download - id: 3bced3-MmYwM

The Adobe Flash plugin is needed to view this content

Genomic Signal Processing Issues in Engineering

Molecular Medicine

- Edward R. Dougherty
- Department of Electrical and Computer

Engineering, Texas AM University - Division of Computational Biology, Translational

Genomics Research Institute - Department of Pathology, University of Texas,

M.D. Anderson Cancer Center

Systems Medicine

- Systems Biology Understanding the manner in

which the parts of an organism interact in

complex networks. - Systems Medicine Translation of systems biology

into medicine. - Translational Genomics The part of systems

medicine that deals with genome-based systems

engineering.

Goals of Translational Genomics

- Screen for key genes and gene families that

explain specific cellular phenotypes (disease). - Use genomic signals to classify disease on a

molecular level. - Build model networks to study dynamical genome

behavior and derive intervention strategies to

alter undesirable behavior.

Translational Genomics Tools

- Signal Processing
- Pattern Recognition
- Information Theory
- Control Theory
- Network Theory
- Communication Theory

Genomic Signal Processing

- GSP The analysis, processing, and use of genomic

signals for gaining biological knowledge and

translation of that knowledge into systems-based

applications. - Signals generated by the genome must be processed

to characterize their regulatory effects and

their relationship to changes at both the

genotypic and phenotypic levels.

Central Dogma of Molecular Biology

DNA

Transcription

RNA

Translation

Protein

Transcription Factors

Gene Regulation

E1A

Rb

Gene regulatory controls

DNA damage

Myc

E2F

p53

MDM2

Hypoxia

transcription

Gene expression the process by which gene

products (proteins) are made

translation

protein

Gene Expression

- Central Dogma of Molecular Biology Information

flows from DNA to RNA to protein. - Transcription DNA ? RNA
- Translation RNA ? protein
- It is not possible to fully separate the three

levels. - But the high level of interaction insures a

significant amount of system information present

at each level. - Measure gene expression by mRNA abundance.

Microarrays

- Expression microarrays result from a complex

biochemical-optical system incorporating robotic

spotting and computer image formation and

analysis. - They facilitate large-scale surveys of gene

expression in which transcript levels can be

determined for thousands of genes simultaneously. - cDNA Arrays Expressed Sequence Tags (ESTs).
- Oligo Arrays Synthetic oligonucleotides.
- Involve image processing and signal extraction.

Microarray Process

Classification of Diseases

- Find a feature set of expression profiles to

classify disease. - Diagnose cancer
- Type
- Stage
- Prognosis

BRCA Classification

Small-Sample Issues

- Imprecise classifier design designed classifier

can be a poor estimate of the optimal classifier. - Poor error estimation owing to no test data.
- Poor feature selection.
- Dougherty, E. R., "Small Sample Issues for

Microarray-Based Classification," Comparative and

Functional Genomics, Vol. 2, 28-34, 2001. - Dougherty, E. R., Datta, A., and C. Sima,

Dougherty, E. R., Datta, A., and C. Sima,

Research Issues in Genomic Signal Processing,

IEEE Signal Processing Magazine, 22 (6), 46-68,

2005.

Classifier Design

- From a sample form an estimate ?n of ?opt.
- Design cost ?n ?n ? ?opt
- Key issue good design often requires large

samples and it is often impossible to get large

enough samples to sufficiently reduce E?n.

Overfitting

- If we apply a complex classification rule to a

small sample, the rule is likely to conform to

the data too closely. - We constrain classifier complexity to avoid

overfitting, thereby restricting ourselves to

easy problems.

Constraint

- To lower design cost, optimization is constrained

to a subclass C. - Constraint cost ?C ?C ? ?d.
- The savings in design error must exceed the cost

of constraint. - Key problem find appropriate constraints.
- A constraint may be defined in accordance with a

model, or maybe experience has shown a certain

constraint works well in a given setting.

Classifier Design Error

Small-Sample Error Estimation

- Train and test classifier on same data.
- Basic Approaches
- Resubstitution Count errors on training data

(usually low biased). - Re-sampling Design on sub-samples and test on

left-out data. - Regularization Enhance the data or estimate the

distribution.

Cross-Validation

- Error rate estimated by iteratively leaving

out data points, testing on the deleted points,

and averaging. - Cross-validation unbiased in the following sense
- ExpectationCV estimate ? error ? 0
- This says little about the number we care about,
- ExpectationCV estimate ? error
- unless CV variance is small not for small

samples.

Cross-validation Mythology

- Myth Cross-validation is good for small samples.
- CV is good for moderate to large samples because

it allows all data to be used for design. - ? Myth CV always outperforms resubstitution.
- Resub performs as well or better for estimating

predictor error in low connectivity Boolean

networks. - Resub can outperform CV for feature set ranking.
- Resubstitution is much faster to compute.

Deviation Distributions

Experiment 1 (LDA, p2)

Experiment 3 (3NN, p2)

Experiment 5 (CART, p2)

Resubs

leave one out

cv10

cv5

cv10r

bbc

b632

- Braga-Neto, U. M., and E. R. Dougherty, Is

Cross-Validation Valid for Small-Sample

Microarray Classification, Bioinformatics, 20

(3), 374-380, 2004.

Bolstered Error Estimation

- Estimate classifier error by spreading the data

via Bolstering Kernels - Error estimate results from integrating kernels

over the domain to which points should not be

included. - Braga-Neto, U. M., and E. R. Dougherty,

Bolstered Error Estimation, Pattern

Recognition, 37 (6), 1267-1281, 2004.

Bolstering Properties

- Error can be computed via integration with closed

form for LDA and Monte Carlo integration

otherwise. - Choosing variance of bolstering kernel is key

because it affects both bias and variance of the

bolstered estimator. - A method for choosing the variance has been

proposed. - Resubstitution results from zero bolstering

variance.

Deviation Distributions CART, 5 Genes

Feature Selection Impacts Cross-Validation

- Feature selection increases the already large

deviation variance of cross-validation. - Coefficient of Relative Increase in Deviation

Dispersion - ?opt true error using best features.
- ?cv true error using selected features.
- Xiao, Y., Hua, J. and E. R. Dougherty,

Quantification of the Impact of Feature

Selection on Cross-validation Error Estimation,

EURASIP J. Bioinformatics and Systems Biology,

2007.

How Many Features?

- Peaking Phenomenon Overfitting.

Feature-Selection Problem

- Select a subset of k features from a set of n

features with minimum error among all subsets of

size k. - Cover and van Campenhout Theorem All k-element

subsets must be checked. - Heuristic suboptimal algorithms have been

proposed to circumvent the full combinatorial

search. - Issues
- Mathematical analysis of algorithms
- Impact of error estimation
- Impact of sample size

Optimal Number of Features

- Optimal number of features depends on sample

size, classification rule and feature-label

distribution. - Top LDA, linear model, slightly correlated

features. - Bottom LDA, linear model, highly correlated

features. - Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R.

Dougherty, Optimal Number of Features as a

Function of Sample Size for Various

Classification Rules, Bioinformatics, 21(8),

1509-1515, 2005.

Peaking Phenomenon is Nontrivial

- Peaking can be later for smaller samples.
- Top 3NN, nonlinear model, modestly correlated

features. - Bottom Linear SVM, nonlinear model, modestly

correlated features. - Hua, J., Xiong, Z., Lowey, J., Suh, E., and E. R.

Dougherty, Optimal Number of Features as a

Function of Sample Size for Various

Classification Rules, Bioinformatics, 21(8),

1509-1515, 2005.

Impact of Error Estimation on Feature Selection

- Choice of error estimator can be more important

than choice of algorithm. - LDA, Gaussian model, n 50, 5 features from 20.
- Sima, C., Attoor, S., Braga-Neto, U., Lowey, J.,

Suh, E., and E. R. Dougherty, Impact of Error

Estimation on Feature-Selection Algorithms,

Pattern Recognition, 38 (12), 2472-2482, 2005.

What Can We Expect from Feature Selection?

- Top Regression of selected FS error on best FS

error. - Bottom Regression of best FS error on selected

FS error. - Sima, C., and E. R. Dougherty, What Should One

Expect from Feature Selection in Small-Sample

Settings, Bioinformatics, 22 (19), 2430-2436,

2006.

Decorrelation of True and Estimated Errors

- With feature selection, the problem is

decorrelation of the error estimate from the true

error, not increased estimator variance. - Selecting 5 features from 200 with sample size

50. - With feature selection
- Without feature selection
- Hanczar, B., Hua, J., and E. R. Dougherty, Is

There Correlation between the Estimated and True

Classification Errors in Small-Sample Settings?

IEEE Statistical Signal Processing Workshop,

Madison, August, 2007.

Error Bounds

- Distribution-free bounds exist on the RMS between

the error and error estimate. - Typically, they are useless for small samples.
- For n 100, RMS ? 0.435.

Salient Points for Small Samples

- Beware of complex classifiers.
- Keep feature sets small.
- Avoid cross-validation where possible.
- Recognize the heavy influence of the

feature-label distribution and classification

rule. - Report a list of classifiers and feature sets for

analysis. - Issues Analysis of classifier and

feature-selection performance - Better error estimation
- Mathematical analysis of error estimators
- Braga-Neto, U., and E. R. Dougherty, Exact

Performance of Error Estimators for Discrete

Classifiers, Pattern Recognition, 38 (11)

1799-1814, 2005.

Is Knowledge Possible?

- The scientific meaning of a classifier and its

error estimate relate to the properties of the

error estimator. - Choice 1 Estimate population density

impossible. - Choice 2 Distribution-free error bounds

useless. - Answer Model-Based Analysis
- Pattern Recognition ? Data Mining
- Knowledge is possible with proper epistemology.

Apparent Clusters in Microarray Data

Relationship?

patterns

What Are Good Clusters?

- Example
- 2 or 3 clusters?
- What is the best separation?

The Clustering Problem

- Apply a clustering algorithm to data and form

clusters, as every clustering algorithm does. - Say, Gee Whiz! There are known related genes in

a cluster. - Where is the possibility for verification by

prediction? Indeed, what is to be verified?

Clustering and Scientific Knowledge

- A scientific theory requires a model and a

predictive methodology to test model validity. - Classification
- Model classifier and error
- Validity rests on the accuracy of error

estimation - Model inferred (learned from data)
- Clustering as historically used
- Model (algorithm)
- No framework for predictive model testing
- No learning

Probabilistic Theory of Clustering

- Clustering theory in the context of random sets
- Probabilistic error measure based on points being

clustered correctly - Bayes clusterer (optimal clustering algorithm)
- Learning theory for clustering algorithms
- Dougherty, E. R., and M. Brun, A Probabilistic

Theory of Clustering, Pattern Recognition, 37

(5), 917-925, 2004.

Example of Clustering Error

- Left Realization of point process
- Right Output of hierarchical clustering
- Error 40

Validation Indices

- Validation indices are meant to judge the

validity of a clustering output. - They can be based on a number of heuristic

considerations and methodologies. - Do they correspond to scientific validity?
- Do validation indices correlate to clustering

error? - Brun, M., Sima, C., Hua, J., Lowey, J., Carroll,

B., Suh, E., and E. R. Dougherty, Model-Based

Evaluation of Clustering Validation Measures,

Pattern Recognition, 40 (3), 807-824, 2007.

Kendalls Correlation for Indices

- Top Realization of point process
- Bottom Kendalls correlation for different

indices across different clustering algorithms

Regulatory Modeling

- Find analytical tools for genomic data that can

detect multivariate influences on decision-making

produced by complex genetic networks. - Construct the minimal complexity network that can

model sufficient information transfer to achieve

goal. - Less computation
- Less data required for inference
- Given a model, discover ways to intervene in its

dynamics to obtain desired behavior.

Gene Interaction

- Genes interact via multi-protein complexes,

feedback regulation, and pathway networks. - Complex molecular networks underlie biological

function. - Most diseases do not result from a single gene

product. - These interrelationships among genes constitute

gene regulatory networks.

Muscle Network (Drosophila)

- A gene network shows regulatory interaction.
- msp-300 is a hub gene that regulates genes

encoding motor proteins responsible for muscle

contraction. - Zhao, W., Serpedin, E., and E. R. Dougherty,

Inferring Gene Regulatory Networks from Time

Series Data Using the Minimum Description Length

Principle, Bioinformatics, 22 (17, 2129-2135,

2006.

Desirable Model Properties

- Incorporate rule-based dependencies between

genes. - Rule-based dependencies may constitute important

biological information. - Allow systematic study of global network

dynamics. - In particular, individual gene effects on

long-run network behavior. - Cope with uncertainty.
- Small sample size, noisy measurements, robustness
- System must be open to external latent variables

Infer Regulatory Genetic Function?

If gene X1 is active and gene X2 is suppressed,

gene Y would be activated

Can we infer regulatory genetic function from the

cDNA microarray data, for both known and unknown

functions?

Inference From Data

- Key issues
- Complex model
- Limited data
- Lack of appropriate time-course data for dynamics
- Fundamental Principle Use simplest model that

provides sufficient information to accomplish the

task at hand and which is compatible with the

data. - Formalize inference by postulating criteria that

constitute a solution space for the inverse

problem. - Constraint criteria are composed of restrictions

on the form of the network biological,

complexity. - Operational criteria are composed of relations

that must be satisfied between the model and the

data.

Regulatory Logic

- Jacques Monod The logic of biological

regulatory systems abideslike the workings of

computers, by the propositional algebra of George

Boole. - Shmulevich I., and E. R. Dougherty, Genomic

Signal Processing, Princeton University Press,

Princeton, 2007.

Boolean Predictive Relationships

- Boolean Relationships in the NCI 60 ACDS

(Anti-Cancer Drug Screen). - MRC1 VSNL1 ? HTR2C
- SCYA7 CASR ? MU5SAC
- Capture switch-like (ON/OFF) behavior.
- Pal, R., Datta, A., Fornace, A. J., Bittner, M.

L., and E. R. Dougherty, Boolean Relationships

Among Genes Responsive to Ionizing Radiation in

the NCI 60 ACDS, Bioinformatics, 21(8),

1542-1549, 2005.

Basic Structure of Boolean Networks

1 means active/expressed 0 means

inactive/unexpressed

A

B

Boolean function A B X 0 0 1 0 1 1 1 0 0 1

1 1

X

In this example, two genes (A and B) regulate

gene X. In principle, any number of input genes

are possible. Positive/negative feedback is also

common (and necessary for homeostasis).

Network Dynamics

A

B

C

D

E

F

Time

0

1

1

0

0

1

At a given time point, all the genes form a

genome-wide gene activity pattern (GAP). Consider

the state space formed by all possible GAPs.

State Space of Boolean Networks

- Similar GAPs lie close together.
- There is an inherent directionality in the state

space. - Some states are attractors (or limit-cycle

attractors). The system may alternate between

several attractors. - Other states are transient.

Picture generated using the program DDLab.

Probabilistic Boolean Networks

- A PBN is composed of a collection of BNs.
- At any time point, state transitions are

controlled according to one of the BNs. With

some probability, the PBN can switch to a

different BN at a time point. - So long as there is no switch the PBN acts like a

BN. - Allows for random gene perturbations.
- Shmulevich, I., Dougherty, E. R., Kim, S., and W.

Zhang, Probabilistic Boolean Networks A

Rule-based Uncertainty Model for Gene Regulatory

Networks, Bioinformatics, 18, 261-274, 2002. - Shmulevich, I., Dougherty, E. R., and W. Zhang,

From Boolean to Probabilistic Boolean Networks

as Models of Genetic Regulatory Networks,

Proceedings of the IEEE, 90(11), 1778-1792, 2002.

PBN State Space

Attractors BN 2

BN 1

BN 2

Attractors BN 1

Properties of PBNs

- Share the rule-based properties of Boolean

networks. - Models uncertainty.
- Dynamic behavior studied via Markov Chains.
- Close relationship to Bayesian networks.
- Attractors of a PBN are the attractors of the

constituent BNs. - Can leave a BN attractor cycle when BN switches.
- Brun, M., Dougherty, E. R., and I. Shmulevich,

Steady-State Probabilities for Attractors in

Probabilistic Boolean Networks, Signal

Processing, in press, 2005. - Lahdesmaki, H., Hautaniemi, S., Shmulevich, I.,

and Yli-Harja, O., Relationships Between

Probabilistic Boolean Networks and Dynamic

Bayesian Networks as Models of Gene Regulatory

Networks, Signal Processing, in press, 2005.

Various Design Methods Proposed

- Find genes with predictive capability for target

gene (CoD). - Use mutual-information to find related genes.
- Use MDL principle.
- Optimize connectivity in a Bayesian framework

relative to the gene profiles in the data. - Find networks satisfying biologically related

constraints such as limited attractor structure,

transient time, and connectivity. - Assuming steady-state data, require data states

to be attractors. - Assuming biological determinism within a given

cellular context, design a PBN under the

assumption that constituent BNs produce

consistent data subsets in the sample data.

Network Reduction

- Network reduction is often desirable for

computational reasons the state space is too

large. - Delete genes reconstruct connectivity and rules.

- Preserve probability structure.
- Dougherty, E. R., and I. Shmulevich, Mappings

Between Probabilistic Boolean Networks, EURASIP

J. Signal Processing, 83 (4), 799-809, 2003. - Preserve steady-state distribution.
- Ivanov, I., and E. R. Dougherty, Reduction

Mappings Between Probabilistic Boolean Networks,

EURASIP J. Applied Signal Processing, 4 (1),

125-31, 2004. - Preserve dynamical structure.
- Ivanov, I., Pal, R., and E. R. Dougherty,

Reduction Mappings between Probabilistic Boolean

Networks that Preserve Dynamical Structure, IEEE

Trans. Signal Processing, to appear.

Intervention

- A key goal of network modeling is to determine

intervention targets (genes) such that the

network can be persuaded to transition into

desired states. - We desire genes that are the best potential

lever points in the sense of having the

greatest possible impact on desired network

behavior. - Shmulevich, I., Dougherty, E. R., and W. Zhang,

Gene Perturbation and Intervention in

Probabilistic Boolean Networks, Bioinformatics,

18, 1319-1331, 2002.

Dynamics

- Dynamics of PBNs can be studied using Markov

Chain theory. - We can ask the question In the long run, what

is the probability that some given gene(s) will

be ON/OFF?

Medical Benefits of Network Intervention

- Prediction of new targets based on pathway

context. - Stress and toxic response mechanisms.
- Off-target effects of therapeutic compounds.
- Characterization of disease states by dynamic

behavior. - Gene- and protein-expression signatures for

diagnostics. - Regulatory analysis for therapeutic intervention.

Possible Intervention Goals

- Minimize the mean first passage time to a

desirable state. - Maximize the probability of reaching a desirable

state before a certain fixed time. - Minimize the time needed to reach a desirable

state with a given fixed probability. - Shmulevich, I., Dougherty, E. R., and W. Zhang,

Gene Perturbation and Intervention in

Probabilistic Boolean Networks, Bioinformatics,

Vol. 18, 1319-1331, 2002. - Shmulevich, I., Dougherty, E. R., and W. Zhang,

Control of Stationary Behavior in Probabilistic

Boolean Networks by Means of Structural

Intervention, Biological Systems, Vol. 10,

431-446, 2002.

Where and How to Intervene?

suppress or activate?

MBP-1

IAP-1

FRA-1

p21

ATF3

SSAT

REL-B

MDM2

PC-1

BCL3

p53

RCH1

External Control

- Consider an external control variable and a cost

function depending on state desirability and cost

of action. - Minimize the cost function by a sequence of

control actions over time control policy. - Application Design optimal treatment regime to

drive the system away from undesirable states. - Datta, A., Choudhary, A., Bittner, M. L., and E.

R. Dougherty, External Control in Markovian

Genetic Regulatory Networks, Machine Learning,

52, 169-181, 2003. - Pal. R., Datta, A., and E. R. Dougherty, Optimal

Infinite Horizon Control for Probabilistic

Boolean Networks, IEEE Transactions on Signal

Processing, 54 (6), 2375-2387, 2006.

Optimal Control

- Key Objective Optimally manipulate the external

controls to move the GAP from an undesirable

pattern to a desirable pattern. - Use available information, e.g., phenotypic

responses, tumor size, etc. - Require a paradigm for modeling the evolution of

the GAP under different controls. - PBN is one such paradigm.
- Use the associated Markov chain.

Control in PBNs

- Transition Probabilities depend on external

control inputs e.g. chemotherapy, radiation, etc. - Assume m control inputs u1,u2......, um.
- Each input can take on the values 0 ( not

applied) or 1 (applied). - The values of the control inputs can be changed

with time.

Control Setting

- Control input vector at time k u1(k)

,......,um(k) - Change both GAP and control vector to integers,

z(k) and v(k). - Then z(k) and v(k) can take on 2m values.
- We have a system
- w(k 1) w(k)A(v(k))
- A is a stochastic matrix dependent on the control

input. - We have a controlled homogeneous Markov chain.

Costs of Applying Control

- Choose v(0), v(1), ..... to minimize a particular

cost function. - Choice of cost function?
- Consider finite treatment horizon k 0, 1,,M

1. - Let Ck(z(k),v(k)) denote the cost of applying

control v(k) at state z(k). (Input from

biologists) - Cost of control over M 1 time steps

Terminal Costs

- Net result of control action ends up in z(M).
- Penalize z(M) in the cost to reduce chances of

ending up in an undesirable state. - Define CM(z(M)) to be the terminal cost of ending

up in state z(M). - Partition states into equivalence classes.
- Assign higher penalties to states associated with

rapid cell proliferation or reduced apoptosis and

lower penalties for states associated with normal

cell cycle. (Input from biologists)

Total Cost

Optimal Control Problem

WNT5A Network

- Up-regulated WNT5A associated with increased

metastasis. - Cost function penalizes WNT5A being up-regulated.
- Optimal control policy with Pirin as control

gene.

Shift of Steady-State Distribution

- Optimal (infinite horizon) control with pirin has

shifted the steady-state distribution to states

with WNT5A down-regulated (a) with control (b)

without control.

Robust Control

- Owing to model uncertainty, we desire control

policies that perform well under perturbations

from model. - Performance bounds design policy on estimated

transition matrix, P, and bound difference

between the true and estimated controlled

steady-state distribution - ?c ? ?c 1 ? kP ? P row-max
- Pal, R., Datta, A., and E. R. Dougherty, Robust

Intervention in Probabilistic Boolean Networks,

IEEE Transactions on Signal Processing, to

appear.

Robust Control Policies

- Mini-Max Robust Policies Find optimal worst-case

performance over uncertainty class of networks. - Pal, R., Datta, A., and E. R. Dougherty,

Robustness of Intervention Strategies for

Probabilistic Boolean Networks, IEEE GENSIPS,

Tuusula, June, 2007. - Bayesian Robust Policies Find optimal policy

relative to a prior distribution on the

uncertainty class. - Pal, R., Datta, A., and E. R. Dougherty,

Bayesian Robustness in the Control of Gene

Regulatory Networks, IEEE Statistical Signal

Processing Workshop, Madison, August, 2007.

Collaborators

- Translational Genomics Research Institute

(Arizona) - University of Texas M. D. Anderson Cancer

Institute - Tampere University of Technology (Finland)
- University of Sao Paulo (Brazil)
- Strathclyde University (Scotland)
- Columbia University
- Acknowledgements
- National Science Foundation
- National Cancer Institute
- National Human Genome Research Institute
- W. M. Keck Foundation

References

- Genomic Signal Processing and Statistics, eds. E.

R. Dougherty, I. Shmulevich, J. Chen, J. Wang,

Hindawi Press, 2005. - Genomic Signal Processing, I. Shmulevich and E.

R. Dougherty, Princeton University Press, 2007. - Introduction to Genomic Signal Processing and

Control, A. Datta and E. R. Dougherty, CRC Press,

2007. - Dougherty, E. R., and A. Datta, Genomic Signal

Processing Diagnosis and Therapy, IEEE Signal

Processing Magazine, 107-112, 22 (1), 2005. - Dougherty, E. R., Datta, A., and C. Sima,

Research Issues in Genomic Signal Processing,

IEEE Signal Processing Magazine, 22 (6), 46-68,

2005. - Dougherty, E. R., Shmulevich, I., and M. L.

Bittner, Genomic Signal Processing The Salient

Issues, EURASIP J. Applied Signal Processing, 4

(1), 146-153, 2004.