1 / 112

Molecular Signaling Drug Development

CourseDevelopment of Molecular Signatures from

High-Throughput Assay Data

- Alexander Statnikov, Ph.D.
- Director, Computational Causal Discovery

Laboratory - Benchmarking Director, Best Practices Integrative

Informatics Consultation Service - Assistant Professor, Department of Medicine,

Division of Clinical Pharmacology - Center for Health Informatics and Bioinformatics

, NYU School of Medicine - 5/16/2011

Outline

- Part 1 Introduction to molecular signatures
- Part 2 Key principles for developing accurate

molecular signatures - Part 3 Comprehensive evaluation of algorithms to

develop molecular signatures for cancer

classification - Part 4 Analysis and computational dissection of

molecular signature multiplicity - Conclusion
- Homework assignment

Part 1 Introduction to molecular signatures

Definition of a molecular signature

- Molecular signature is a computational or

mathematical model that links high-dimensional

molecular information to phenotype or other

response variable of interest.

FDA view on molecular signatures

- The FDA calls them in vitro diagnostic

multivariate index assays - 1. Class II Special Controls Guidance Document

Gene Expression Profiling Test System for Breast

Cancer Prognosis - Addresses device classification.
- 2. The Critical Path to New Medical Products
- - Identifies pharmacogenomics as crucial to

advancing medical product development and

personalized medicine. - 3. Draft Guidance on Pharmacogenetic Tests and

Genetic Tests for Heritable Markers Guidance

for Industry Pharmacogenomic Data Submissions - Identifies 3 main goals (dose, ADEs, responders),
- Defines IVDMIA,
- Encourages fault-free sharing of

pharmacogenomic data, - Separates probable from valid biomarkers,
- Focuses on genomics (and not other omics).

Main uses of molecular signatures

- Direct benefits Models of disease

phenotype/clinical outcome - Diagnosis
- Prognosis, long-term disease management
- Personalized treatment (drug selection,

titration) - Ancillary benefits 1 Biomarkers for diagnosis,

or outcome prediction - Make the above tasks resource efficient, and easy

to use in clinical practice - Helps next-generation molecular imaging
- Leads for potential new drug candidates
- Ancillary benefits 2 Discovery of structure

mechanisms (regulatory/interaction networks,

pathways, sub-types) - Leads for potential new drug candidates

Less conventional uses of molecular signatures

- Increase clinical trial sample efficiency and

decrease costs or both, using placebo responder

signatures - In silico signature-based candidate drug

screening - Drug resurrection
- Establishing existence of biological signal in

very small sample situations where univariate

signals are too weak - Assess importance of markers and of mechanisms

involving those - Choosing the right animal model
- ?

Recent molecular signatures available for

patient care

Agendia

Clarient

Prediction Sciences

LabCorp

Veridex

University Genomics

Genomic Health

BioTheranostics

Applied Genomics

Power3

Correlogic Systems

Molecular signatures in the market

Company Product Disease Purpose

Agendia MammaPrint Breast cancer Risk assessment for the recurrence of distant metastasis in a breast cancer patient.

Agendia TargetPrint Breast cancer Quantitative determination of the expression level of estrogen receptor, progesteron receptor and HER2 genes. This product is supplemental to MammaPrint.

Agendia CupPrint Cancer Determination of the origin of the primary tumor.

University Genomics Breast Bioclassifier Breast cancer Classification of ER-positive and ER-negative breast cancers into expression-based subtypes that more accurately predict patient outcome.

Clarient Insight Dx Breast Cancer Profile Breast cancer Prediction of disease recurrence risk.

Clarient Prostate Gene Expression Profile Prostate cancer Diagnosis of grade 3 or higher prostate cancer.

Prediction Sciences RapidResponse c-Fn Test Stroke Identification of the patients that are safe to receive tPA and those at high risk for HT, to help guide the physicians treatment decision.

Genomic Health OncotypeDx Breast cancer Individualized prediction of chemotherapy benefit and 10-year distant recurrence to inform adjuvant treatment decisions in certain women with early-stage breast cancer.

bioTheranostics CancerTYPE ID Cancer Classification of 39 types of cancer.

bioTheranostics Breast Cancer Index Breast cancer Risk assessment and identification of patients likely to benefit from endocrine therapy, and whose tumors are likely to be sensitive or resistant to chemotherapy.

Applied Genomics MammaStrat Breast cander Risk assessment of cancer recurrence.

Applied Genomics PulmoType Lung cancer Classification of non-small cell lung cancer into adenocarcinoma versus squamous cell carcinoma subtypes.

Applied Genomics PulmoStrat Lung cancer Assessment of an individual's risk of lung cancer recurrence following surgery for helping with adjuvant therapy decisions.

Correlogic OvaCheck Ovarian cancer Early detection of epithelial ovarian cancer.

LabCorp OvaSure Ovarian cancer Assessment of the presence of early stage ovarian cancer in high-risk women.

Veridex GeneSearch BLN Assay Breast cancer Determination of whether breast cancer has spread to the lymph nodes.

Power3 BC-SeraPro Breast cancer Differentiation between breast cancer patients and control subjects.

MammaPrint

- Developed by Agendia (www.agendia.com)
- 70-gene signature to stratify women with breast

cancer that hasnt spread into low risk and

high risk for recurrence of the disease - Independently validated in gt1,000 patients
- So far performed gt10,000 tests
- Cost of the test is 3,000
- In February, 2007 the FDA cleared the

MammaPrint test for marketing in the U.S. for

node negative women under 61 years of age with

tumors of less than 5 cm. - TIME Magazines 2007 medical invention of the

year.

Oncotype DX

- Developed by Genomic Health (www.genomichealth.c

om ) - 21-gene signature to predict whether a woman

with localized, ER breast cancer is at risk of

relapse - Independently validated in gt1,000 patients
- So far performed gt50,000 tests
- Cost of the test is 3,000
- The following paper shows the health benefits

and cost-effectiveness benefits of using Oncotype

DX http//www3.interscience.wiley.com/cgi-bin/abs

tract/114124513/ABSTRACT

Part 2Key principles for developing accurate

molecular signatures

Main ingredients for developing a molecular

signature

Well-defined clinical problem access to

patients/ samples

Computational biostatistical Analysis

Molecular Signature

High-throughput assays

Challenges in computational analysis of omics

data

- Relatively easy to develop a predictive model and

even easier to believe that a model is good when

it is not ? false sense of security - Several problems exist some theoretical and some

practical - Omics data has many special characteristics and

is tricky to analyze!

Example OvaCheck (1/2)

- Developed by Correlogic (www.correlogic.com)
- Blood test for the early detection of epithelial

ovarian cancerÂ - Failed to obtain FDA approval
- Looks for subtle changes in patterns among the

tens of thousands of proteins, protein fragments

and metabolites in the blood - Signature developed by genetic algorithm
- Significant artifacts in data collection

analysis questioned validity of the signature - Results are not reproducible
- Data collected differently for different groups

of patients - http//www.nature.com/nature/journal/v429/n6991/fu

ll/429496a.html

Example OvaCheck (2/2)

A

B

C

Figure from Baggerly et al (Bioinformatics,

2004)

D

E

F

An early kind of analysis Learning disease

sub-types by clustering patient profiles

p53

Rb

Clustering Seeking natural groupings hoping

that they will be useful

p53

Rb

E.g., for classification (predict response to

treatment)

p53

Respond to treatment Tx1

Do not Respond to treatment Tx1

Rb

Another use of clustering

- Cluster genes (instead of patients)
- Genes that cluster together may belong to the

same pathways - Genes that cluster apart may be unrelated

Unfortunately clustering is a non-specific method

and falls into the one-solution fits all trap

when used for classification

p53

Squamous carcinoma

Adenocarcinoma

Rb

Clustering is also non-specific when used to

discover pathways, or other mechanistic

relationships

It is entirely possible in this simple

illustrative counter-example for G3 (a causally

unrelated gene to the phenotype) to be more

strongly associated and thus cluster with the

phenotype (or its surrogate genes) than the true

causal oncogenes G1, G2

G1

G2

Ph

G3

Two improved classes of methods

- Supervised learning ? classification/molecular

signatures and markers - Regulatory network reverse engineering ? pathways

Supervised learning Use the known phenotypes

(a.k.a. class labels) in training data to build

signatures or find markers highly specific for

that phenotype

A

Classifier/ Regression Algorithm

Training samples

B

C

Molecular signature

D

T

Testing/ Validation samples

A1, B1, C1, D1, T1 A2, B2, C2, D2, T2 An, Bn,

Cn, Dn, Tn

Classification Performance

Input data for supervised learning methods

- Class Label Variables/features

Primary Metastatic Primary Metastatic Metastatic P

rimary Metastatic Metastatic Metastatic Primary Me

tastatic Primary

Principles and geometric representation for

supervised learning (1/7)

- Want to classify objects as boats and houses.

Principles and geometric representation for

supervised learning (2/7)

- All objects before the coast line are boats and

all objects after the coast line are houses. - Coast line serves as a decision surface that

separates two classes.

Principles and geometric representation for

supervised learning (3/7)

These boats will be misclassified as houses

This house will be misclassified as boat

Principles and geometric representation for

supervised learning (4/7)

Longitude

Boat

House

Latitude

- The methods that build classification models

(i.e., classification algorithms) operate very

similarly to the previous example. - First all objects are represented geometrically.

Principles and geometric representation for

supervised learning (5/7)

Longitude

Boat

House

Latitude

Then the algorithm seeks to find a decision

surface that separates classes of objects

Principles and geometric representation for

supervised learning (6/7)

Longitude

These objects are classified as houses

?

?

?

?

?

?

These objects are classified as boats

Latitude

Unseen (new) objects are classified as boats if

they fall below the decision surface and as

houses if the fall above it

Principles and geometric representation for

supervised learning (7/7)

Longitude

Object 1

Object 2

Object 3

Latitude

In 2-D this looks simple but what happens in

higher dimensional data

- 10,000-50,000 (gene expression microarrays, aCGH,

and early SNP arrays) - gt500,000 (tiled microarrays, SNP arrays)
- 10,000-300,000 (regular MS proteomics)
- gt10,000,000 (LC-MS proteomics)
- gt100,000,000 (next-generation sequencing)
- This is the curse of dimensionality problem

High-dimensionality (especially with small

samples) causes

- Some methods do not run at all (classical

regression) - Some methods give bad results (KNN, Decision

trees) - Very slow analysis
- Very expensive/cumbersome clinical application
- Tends to overfit

Two problems Over-fitting Under-fitting

- Over-fitting (a model to your data) building a

model that is good in original data but fails to

generalize well to new/unseen data - Under-fitting (a model to your data) building a

model that is poor in both original data and

new/unseen data

Over/under-fitting are related to complexity of

the decision surface and how well the training

data is fit

Over/under-fitting are related to complexity of

the decision surface and how well the training

data is fit

Outcome of Interest Y

This line is good!

This line overfits!

Training Data Future Data

Predictor X

Over/under-fitting are related to complexity of

the decision surface and how well the training

data is fit

Outcome of Interest Y

This line is good!

This line underfits!

Training Data Future Data

Predictor X

Very important concept

- Successful data analysis methods balance training

data fit with complexity. - Too complex signature (to fit training data well)

?overfitting (i.e., signature does not

generalize) - Too simplistic signature (to avoid overfitting) ?

underfitting (will generalize but the fit to both

the training and future data will be low and

predictive performance small).

The Support Vector Machine (SVM) approach for

building molecular signatures

- Support vector machines (SVMs) is a binary

classification algorithm. - SVMs are important because of (a) theoretical

reasons - Robust to very large number of variables and

small samples - Can learn both simple and highly complex

classification models - Employ sophisticated mathematical principles to

avoid overfitting - and (b) superior empirical results.

Main ideas of SVMs (1/3)

Gene Y

Cancer patients

Normal patients

Gene X

- Consider example dataset described by 2 genes,

gene X and gene Y - Represent patients geometrically (by vectors)

Main ideas of SVMs (2/3)

Gene Y

Cancer patients

Normal patients

Gene X

- Find a linear decision surface (hyperplane)

that can separate patient classes and has the

largest distance (i.e., largest gap or

margin) between border-line patients (i.e.,

support vectors)

Main ideas of SVMs (3/3)

- If such linear decision surface does not exist,

the data is mapped into a much higher dimensional

space (feature space) where the separating

decision surface is found - The feature space is constructed via very clever

mathematical projection (kernel trick).

On estimation of signature accuracy

test

data

train

Large sample case use hold-out validation

train

train

train

test

train

train

train

data

Small sample case use N-fold cross-validation

test

test

test

test

test

Nested N-fold cross-validation

Recall the main idea of cross-validation

data

Overview of challenges in computational analysis

of omics data for development of molecular

signatures

Rashomon effect/ Marker multiplicity

Assay validity/ reproducibility

Efficiency Statistical/ Computational

Research Designs

Data Analytics of Molecular Signatures

Is there predictive signal?

Causality vs predictiveness/ Biological

Significance

Methods Development Re-inventing the wheel

specialization

Epistasis

Many variables, small sample, noise, artifacts

Instability

Performance Predictivity, compactness

Protocols/Guidelines

Editorializing/ Over-simplifying/ Sensationalism

Part 3Comprehensive evaluation of algorithms to

develop molecular signatures for cancer

classification

Comprehensive evaluation of algorithms for

classification of cancer microarray data

- Main goals
- Find the best performing algorithms for building

molecular signatures for cancer diagnosis from

microarray gene expression data - Investigate benefits of using gene selection and

ensemble classification methods.

Classification algorithms

- K-Nearest Neighbors (KNN)
- Backpropagation Neural Networks (NN)
- Probabilistic Neural Networks (PNN)
- Multi-Class SVM One-Versus-Rest (OVR)
- Multi-Class SVM One-Versus-One (OVO)
- Multi-Class SVM DAGSVM
- Multi-Class SVM by Weston Watkins (WW)
- Multi-Class SVM by Crammer Singer (CS)
- Weighted Voting One-Versus-Rest
- Weighted Voting One-Versus-One
- Decision Trees CART

instance-based

neural networks

kernel-based

voting

decision trees

Ensemble classification methods

Gene selection methods

- Signal-to-noise (S2N) ratio in one-versus-rest

(OVR) fashion - Signal-to-noise (S2N) ratio in one-versus-one

(OVO) fashion - Kruskal-Wallis nonparametric one-way ANOVA (KW)
- Ratio of genes between-categories to

within-category sum of squares (BW).

Performance metrics andstatistical comparison

- Accuracy
- can compare to previous studies
- easy to interpret simplifies statistical

comparison

- 2. Relative classifier information (RCI)
- easy to interpret simplifies statistical

comparison - not sensitive to distribution of classes
- accounts for difficulty of a decision problem

- Randomized permutation testing to compare

accuracies - of the classifiers (?0.05)

Microarray datasets

- Total
- 1300 samples
- 74 diagnostic categories
- 41 cancer types and
- 12 normal tissue types

Summary of methods and datasets

Results without gene selection

Results with gene selection

Improvement of diagnostic performance by gene

selection (averages for the four datasets)

Diagnostic performance before and after gene

selection

Average reduction of genes is 10-30 times

Comparison with previously published results

Summary of results

- Multi-class SVMs are the best family among the

tested algorithms outperforming KNN, NN, PNN, DT,

and WV. - Gene selection in some cases improves

classification performance of all classifiers,

especially of non-SVM algorithms - Ensemble classification does not improve

performance of SVM and other classifiers - Results obtained by SVMs favorably compare with

the literature.

Random Forest (RF) classifiers

- Appealing properties
- Work when of predictors gt of samples
- Embedded gene selection
- Incorporate interactions
- Based on theory of ensemble learning
- Can work with binary multiclass tasks
- Does not require much fine-tuning of parameters
- Strong theoretical claims
- Empirical evidence (Diaz-Uriarte and Alvarez de

Andres, BMC Bioinformatics, 2006) reported

superior classification performance of RFs

compared to SVMs and other methods

Key principles of RF classifiers

Testing data

Training data

4) Apply to testing data combine predictions

1) Generate bootstrap samples

2) Random gene selection

3) Fit unpruned decision trees

Results without gene selection

- SVMs nominally outperform RFs is 15 datasets, RFs

outperform SVMs in 4 datasets, algorithms are

exactly the same in 3 datasets. - In 7 datasets SVMs outperform RFs statistically

significantly. - On average, the performance advantage of SVMs is

0.033 AUC and 0.057 RCI.

Results with gene selection

- SVMs nominally outperform RFs is 17 datasets, RFs

outperform SVMs in 3 datasets, algorithms are

exactly the same in 2 datasets. - In 1 dataset SVMs outperform RFs statistically

significantly. - On average, the performance advantage of SVMs is

0.028 AUC and 0.047 RCI.

Part 4Analysis and computational dissection of

molecular signature multiplicity

Molecular signature multiplicity

- Different methods or samples from the same

population lead to different but apparently

maximally predictive signatures - Far-reaching implications for biological

discovery and development of next generation

patient diagnostics and personalized treatments - Generation of biological hypotheses is very hard

even when signatures are maximally predictive of

the phenotype since thousands of completely

different signatures are equally consistent with

the data - Produced signatures are not statistically

generalizable to new cases, and thus not reliable

enough for translation to clinical practice.

Molecular signature multiplicity

- Causes of this phenomenon are unknown several

contradictory conjectures exist in the field - Signature multiplicity is due to small samples

Michiels et al., 2005 - Signature multiplicity leads to predictively

non-reproducible signatures Ein-Dor et al.,

2006 building reproducible signatures requires

thousands of samples Ioannidis, 2005 - Signature multiplicity is a by-product of the

complex regulatory connectivity of genome

Dougherty and Brun, 2006 - Artifacts of data pre-processing, e.g.

normalization Gold et al., 2005 Qiu et al.,

2005 Ploner et al., 2005

Major goals

- Develop a Markov boundary characterization of

molecular signature multiplicity phenomenon - Design and study algorithms that can correctly

identify the set of maximally predictive and

non-redundant molecular signatures - Conduct an empirical evaluation of the novel

algorithms and compare to the existing

state-of-the-art methods - Test and refine previously stated hypotheses

about the causes of signature multiplicity

phenomenon.

Optimality criteria of signatures

- Signatures that are focus of this research

satisfy the following two optimality criteria - maximally predictive of the phenotype (they

achieve best predictivity of the phenotype in the

given dataset over all signatures based on

different gene sets) - do not contain predictively redundant genes

(i.e., genes that can be removed from the

signature without adversely affecting its

predictivity).

Why do we need algorithms to extract as many

optimal signatures as possible?

- A deeper understanding of the signature

multiplicity phenomenon and how it affects

reproducibility of signatures - Improving discovery of the underlying biological

mechanisms by not missing genes that are

implicated biologically in disease processes - Catalyzing regulatory approval by establishing

in-silico equivalence to previously validated

signatures

Existing algorithms for multiple signature

extraction Resampling-based methods

Training data

1) Generate resampled datasets (e.g., by

bootstrapping)

2) Apply a standard signature extraction

algorithm (e.g., SVM-RFE)

X1

X2

X3

XN

- Based on assumption that multiplicity is strictly

a small-sample phenomenon - An infinite number of resamplings is required to

extract all optimal signatures - May stop producing multiple signatures in large

sample sizes.

Existing algorithms for multiple signature

extraction Iterative removal

Original data (for all genes)

Remove corresponding genes from the dataset

X1

Reduced data (excluding X1 genes)

Remove corresponding genes from the dataset

X2

Reduced data (excluding X1 and X2 genes)

Remove corresponding genes from the dataset

X3

until a signature has statistically

significantly reduced predictivity

- Agnostic to what causes molecular signature

multiplicity - Cannot discover signatures that have genes in

common.

Existing algorithms for multiple signature

extraction Stochastic gene selection

Genetic Algorithms (e.g., GA/KNN or GA/SVM)

- Can output all signatures that are discoverable

by a genetic algorithm when it is allowed to

evolve an infinite number of generations.

KIAMB

- Stochastic Markov boundary method based on IAMB

algorithm - In a specific class of distributions, every

optimal signature will be output by this method

with nonzero probability - Requires an infinite number of iterations to

discover all optimal signatures will discover

same signature over and over again - Sample requirements are of exponential order to

the number of genes in a signatures.

Existing algorithms for multiple signature

extraction Brute-force exhaustive search

LIKNON

- Examines predictivity of all individual genes in

the dataset, all pairs of genes, all triples of

genes, and so on - It is infeasible when a signature has more than

2-3 genes - Agnostic to what causes signature multiplicity.

In summary, no current algorithm provides a

systematic and efficient approach for

identification of the set of maximally predictive

and non-redundant molecular signatures that exist

in the underlying distribution.

I. Markov boundary characterization of molecular

signature multiplicity

Key definitions (1/2)

- Definition of maximally predictive molecular

signature A maximally predictive molecular

signature is a molecular signature that maximizes

predictivity of the phenotype relative to all

other signatures that can be constructed from the

same dataset. - Definition of maximally predictive and

non-redundant molecular signature A maximally

predictive and non-redundant molecular signature

based on variables X is a maximally predictive

signature such that any signature based on a

proper subset of variables in X is not maximally

predictive.

Key definitions (2/2)

- Definition of Markov blanket A Markov blanket M

of the response variable T ? V in the joint

probability distribution P over variables V is a

set of variables conditioned on which all other

variables are independent of T, i.e. for every

, - .
- Definition of Market boundary (or non-redundant

Markov blanket) If M is a Markov blanket of T

and no proper subset of M satisfies the

definition of Markov blanket of T, then M is

called a Markov boundary (or non-redundant Markov

blanket) of T.

Theoretical results

- Variable sets that participate in the maximally

predictive signatures of T are precisely the

Markov blankets of T and vice-versa - Similarly, variable sets that participate in the

maximally predictive and non-redundant signatures

of T are precisely the Markov boundaries of T and

vice-versa - If a joint probability distribution P over

variables V satisfies the intersection property,

then there exists a unique Markov boundary of T

Pearl, 1988.

A fundamental reduction used in this research for

the analysis of signatures

S1

S2

S3

S4

S5

Cases

Gene Y

Controls

Signatures that have maximal predictivity of the

phenotype relative to their genes.

Signatures with worse predictivity

Gene X

- Since there is an infinite number of signatures

with maximal predictivity, when I refer to a

signature, I mean one of the predictively

equivalent classifiers (e.g., S3 or S4 or S5) - Can study signature classes by reference only to

their genes - This reduction is justified whenever the

classifiers used can learn the minimum error

decision function given sufficient sample.

Example of Markov boundary multiplicity

Network structure

Distributional constraints

- Many optimal signatures exist e.g., A, C and

B, C are maximally predictive and non-redundant

signatures of T. Furthermore, A, C and B, C

remain maximally predictive even in infinite

samples - The network has very low connectivity
- Genes in optimal signatures do not have to be

deterministically related e.g., A and B are not

deterministically related, yet convey

individually the same information about T - If an algorithm selects only one optimal

signature, then there is danger to miss

biologically important causative genes - The union of all optimal signatures includes all

genes located in the local pathway around T - In this example the intersection of all optimal

signatures contains only genes in the local

pathway around T.

II. A Novel algorithm to correctly identify the

set of maximally predictive and non-redundant

signatures

TIE generative algorithm

TIE algorithm for gene expression data analysis

Trace of the TIE algorithm

Not a Markov boundary Do not consider any G that

is a superset of F

GF

Mnew A, B

GA

Mnew C, B, F

Markov boundary

M A, B, F

GB

Mnew A, D, E, F

Markov boundary

Mnew C, D, E, F

Markov boundary

GA,B

Theoretical results (1/2)

- TIE returns all and only Markov boundaries of T

(i.e., maximally predictive and non-redundant

signatures) if its input components X, Y, Z are

admissible - IAMB is an admissible Markov boundary algorithm

(input component X) under assumptions - IAMB correctly outputs a Markov boundary if only

the composition property holds - HITON-PC is an admissible Markov boundary

algorithm (input component X) under assumptions - HITON-PC correctly outputs a Markov boundary if

the adjacency faithfulness assumption holds

except for violations of the intersection axiom,

global Markov condition holds, and there are no

spouses in the Markov boundary

Theoretical results (2/2)

- Stated three strategies (IncLex, IncMinAssoc, and

IncMaxAssoc) to generate subsets of variables

that have to be removed from V to identify new

Markov boundaries of T and proved their

admissibility (input component Y) - Stated two criteria (Independence and

Predictivity) to verify Markov boundaries and

proved their admissibility (input component Z).

III. Empirical evaluation of the novel algorithms

and comparison with existing state-of-the-art

methods

A. Experiments with artificial simulated data

- Generative model is available, and the set of

Markov boundaries (and thus the set of maximally

predictive and non-redundant signatures) is

known. - Generate samples of systematically varied sizes
- Compare to the gold standard
- Test whether the TIE algorithm behaves according

to theoretical expectations and study its

empirical properties - Obtain clues about behavior of TIE and baseline

comparison algorithms in experiments with real

gene expression data.

Experiments with discrete networks TIED1 and

TIED2

- Two artificial discrete networks were created
- TIED1 consists of 30 variables (including a

response variable T) and contains 72 Markov

boundaries of T - TIED2 consists of 1,000 variables (including a

response variable T) and contains the same 72

Markov boundaries of T as TIED1.

Experiments

- Goal Compare TIE to state-of-the-art algorithms

(Resampling-based methods, KIAMB, and Iterative

Removal) and examine sensitivity of the tested

methods to high dimensionality. - Findings
- TIE correctly identifies the set of true Markov

boundaries (maximally predictive and

non-redundant signatures) in the datasets with 30

or 1,000 variables - Iterative Removal identifies only 1 signature
- KIAMB fails to identify any true signature, and

its output signatures have poor predictivity - Resampling-based methods either miss true

signatures and/or output many redundant variables

in the signatures.

Experiments with linear continuous network LIND

LIND consists of 41 variables (including a

response variable T) and contains 12 Markov

boundaries of T.

Experiments

- Goals
- Analyze behavior of TIE as a function of sample

size using data generated from a continuous

network - Compare criteria Independence and Predictivity

for verification of Markov boundaries in the TIE

algorithm. - Findings
- As sample size increases, the performance of both

instantiations of TIE generally improves and the

algorithms discover the set of true Markov

boundaries - ?-level in the criterion Predictivity

significantly affects the number of Markov

boundaries output by the TIE algorithm - TIE with criterion Predictivity typically leads

to a larger number of output Markov boundaries

and on average superior performance compared to

criterion Independence.

Experiments with discrete network XORD

XORD consists of 41 variables (including a

response variable T) and contains 25 Markov

boundaries of T.

Experiments

- Goal Evaluate TIE when the popular Markov

boundary algorithms such as IAMB and HITON-PC are

not applicable due to violations of their

fundamental assumptions. - Findings
- TIE discovers the set of true Markov boundaries

when the sample is 2,000 - There is 1 false positive variable in each

discovered Markov boundary for large sample sizes.

B. Experiments with resimulated microarray gene

expression data

- Resimulated data by design closely resembles real

human lung cancer microarray gene expression

data - The knowledge of a generative model allows to

generate arbitrary large samples and study

behavior of TIE as a function of sample size - Unlike prior experiments with artificial

simulated datasets, the set of maximally

predictive and non-redundant signatures is not

known a priori.

Experiment

Goal Examine whether the signature multiplicity

phenomenon vanishes as the sample size grows.

Results

Findings of other experiments

- TIE is not sensitive to the choice of the

initial signature discovered by the algorithm - Post-processing TIE signatures with wrapping

results in more signatures with smaller number of

genes - Signatures output by tested non-TIE methods are

either redundant or have inferior predictivity

compared to signatures output by TIE techniques.

C. Experiments with real human microarray gene

expression data

- Independent-Dataset Experiments Using pairs of

microarray datasets either from different

laboratories or different platforms - Single-Dataset Experiments Additional

experiments with relatively large sample size

microarray datasets - The primary goal of both experiments is to

compare TIE and baseline algorithms for multiple

signature extraction in terms of maximal

predictivity? of induced signatures and

reproducibility in independent data. - Operational definition of maximal predictivity

Empirically best classification performance (AUC)

achievable in each dataset over all tested

methods consideration.

Independent-dataset experiments Datasets

Task Discovery dataset Discovery dataset Discovery dataset Discovery dataset Validation dataset Validation dataset Validation dataset Validation dataset Number of common genes

Task Sample size Samples per class Number of genes Microarray platform Sample size Samples per class Number of genes Microarray platform Number of common genes

Lung Cancer Diagnosis lung tumors vs. normals (non-tumor lung samples) 203 lung tumors (186)normals (17) 12600 Affymetrix U95A 96 lung tumors (86)normals (10) 7129 Affymetrix HuGeneFL 7094

Lung Cancer Subtype Classification adenocarcinoma vs. squamous cell carcinoma lung tumors 160 adenocarcinoma (139)squamous (21) 12600 Affymetrix U95A 28 adenocarcinoma (14)squamous (14) 12533 Affymetrix U95A 12533

Breast Cancer Subtype Classification estrogen receptor positive (ER) vs. ER- breast tumors untreated patients 286 ER (209)ER- (77) 22283 Affymetrix U133A 119 ER (85)ER- (34) 22283 Affymetrix U133A 22283

Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 204 poor prognosis (66)good prognosis (138) 22283 Affymetrix U133A 72 poor prognosis (13)good prognosis (59) 22283 Affymetrix U133A 22283

Glioma Subtype Classification grade III vs. grade IV glioma tumors 100 grade III (24)grade IV (76) 22283 Affymetrix U133A 85 grade III (26)grade IV (59) 22283 Affymetrix U133A 22283

Leukemia 5 Yr. Prognosis patients with disease-free survival lt 5 years (ones who had relapse or competing events within 5 years) vs. gt 5 years 164 survival lt 5 yr. (29)survival gt 5 yr. (135) 12625 Affymetrix U95A 79 survival lt 5 yr. (18)survival gt 5 yr. (61) 22283 Affymetrix U133A 10507

Detailed results (1/3)

Detailed results (2/3)

Detailed results (3/3)

TIE signatures have maximal predictivity

- TIE achieves maximal predictivity in 5 out of 6

validation datasets - Non-TIE methods achieve maximal predictivity in

0 to 2 datasets depending on the method - In the dataset where the predictivity of TIE is

statistically distinguishable from the

empirically maximal one (Lung Cancer Subtype

Classification), the magnitude of this difference

is only 0.009 AUC on average over all discovered

signatures.

TIE signatures are reproducible, other

signatures may be overfitted

- TIE has no overfitting on average over all

signatures and datasets - Other methods achieve predictivity in the

validation data that is lower than one in the

discovery data (by 0.02-0.03 AUC), besides having

inferior predictivity

TIE signatures in comparison with other

signatures

Predictivity results for Leukemia 5 Yr. Prognosis

task

Classification performance (AUC) in discovery

dataset

Each dot in the plot corresponds to a signature

(computational model) of the outcome E.g.,

Outcome(x)Sign(wxb), where x, w ? ?m, b ? ?,

m is the number of genes in the signature.

Classification performance (AUC) in validation

dataset

Single-dataset experiments Datasets

Task Sample size Samples per class Number of genes Microarray platform

Lymphoma Subtype Classification I Diffuse large-B-cell lymphoma (DLBCL) vs. Burkitt's lymphoma (BL) patients 303 DLBCL (258)BL (45) 2745 Human LymphDx 2.7k GeneChip

Lymphoma Subtype Classification II Diffuse large-B-cell lymphoma (DLBCL) vs. mediastinal large B-cell (MLBCL) patients 210 DLBCL (176)MLBCL (34) 32403 (44928) Affymetrix U133A and U133B

Breast Cancer Subtype Classification I p53 mutant vs. wild-type breast tumors 251 p53 mutant (58)p53 wild-type (193) 22283 Affymetrix U133A

Breast Cancer Subtype Classification II estrogen receptor positive (ER) vs. ER- breast tumors 247 ER (213)ER- (34) 22283 Affymetrix U133A

Breast Cancer Subtype Classification III progesterone receptor positive (PgR) vs. PgR- breast tumors 251 PgR (190)PgR- (61) 22283 Affymetrix U133A

Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 215 poor prognosis (51)good prognosis (164) 24496 Agilent Hu25K

Bladder Cancer Stage Classification stage Ta. vs. other stages (T1, T2, T3, T4) of bladder tumors 404 stage Ta (189)other stages (215) 1381 (3072) MDL Human 3k

- Validation dataset ? subset of 100

samples/patients - Discovery dataset ? all remaining

samples/patients - Repeat splits into discovery validation

datasets 10 times to minimize variance

Single-dataset experiments Summary results

- Results are similar to the ones from

independent-dataset experiments - TIE achieves maximal predictivity in 6 out of 7

validation datasets - Non-TIE methods achieve maximal predictivity in

0 to 1 datasets depending on the method - In the dataset where TIE has predictivity that

is statistically distinguishable from the

empirically maximal one (Breast Cancer Subtype

Classification II), the magnitude of this

difference is only lt0.01 AUC on average over all

discovered signatures.

IV. Discussion and interpretation of results

Revisiting previously published hypotheses about

signature multiplicity

- Signature reproducibility neither precludes

multiplicity nor requires sample sizes with

thousands of subjects - Multiplicity of signatures does not require dense

connectivity - Noisy measurements or normalization are not

necessary conditions for signature multiplicity - Multiplicity can be produced by a combination of

small sample size-related variance and intrinsic

multiplicity in the underlying network - Multiple signatures output by TIE are

reproducible even though they are derived from

small sample, noisy, and heavily-processed data.

A more complete picture is emerging regarding

causes of multiplicity...

- Intrinsic information redundancy in the

underlying biological system - Variability in the output of gene selection and

classifier algorithms especially in small sample

sizes - Small sample statistical indistinguishability of

signatures with different large sample

predictivity and/or redundancy characteristics - Presence of hidden variables
- Correlated measurement noise
- RNA amplification techniques that systematically

distort measurements of transcript ratios - Cellular aggregation and sampling from mixtures

of distributions that affect inference of

conditional independence relations - Normalization and other data pre-processing

methods that artificially increase correlations

among genes - Engineered redundancy in the assay technology

platforms.

Summary of results

- Developed a Markov boundary characterization of

molecular signature multiplicity - Designed a generative algorithm that can

correctly identify the set of maximally

predictive and non-redundant molecular signatures

in principle independently of data distribution - Conducted an empirical evaluation of the novel

algorithm and compared it to existing

state-of-the-art methods using artificial

simulated, resimulated microarray gene

expression, and real human microarray gene

expression data - Tested and refined several hypotheses about the

causes of molecular signature multiplicity

phenomenon.

General conclusions

- Molecular signatures play a crucial role in

personalized medicine and translational

bioinformatics. - Molecular signatures are being used to treat

patients today, not in the future. - Development of accurate molecular signature

should rely on use of supervised methods. - In general, there are many challenges for

computational analysis of omics data for

development of molecular signatures. - One of these challenges is molecular signature

multiplicity. - There exist an algorithm that can extract the set

of maximally predictive and non-redundant

molecular signatures from high-throughput data.

Homework (Due next Monday)

- Read the paper Analysis and Computational

Dissection of Molecular Signature Multiplicity. - Describe a novel and interesting application area

for TIE algorithm. Feel free to use and example

from your research where there exist many

molecular signatures of some response variable

(1/2 page max). - Come up with another cause of molecular signature

multiplicity that was not mentioned in the paper

(1/2 page max). - Email your work to Alexander.Statnikov_at_med.nyu.edu

Computational Causal Discovery Laboratory at NYU

Center for Health Informatics and Bioinformatics

(CHIBI)

- The purpose of our lab is to develop, test and

apply computational causal discovery methods

suitable for molecular, clinical, imaging and

multi-modal data of high-dimensionality. - We are interested in methods to address the

following questions - What is causing disease/phenotype?
- What are the effects of disease/phenotype?
- What are involved biological pathways?
- How to design drugs/treatments?
- How genotype causes differences in response to

treatment? - How the environment modifies or even supersedes

the normal causal function of genes and other

molecular variables? - How genes and proteins are organized in complex

causal regulatory networks? - Questions? Email to Alexander.Statnikov_at_med.nyu.ed

u