Molecular Signaling

About This Presentation

Title:

Molecular Signaling

Description:

Title: Computational advances in reverse-engineering regulatory networks and pathways Author: Alexander Statnikov Last modified by: Alexander Statnikov – PowerPoint PPT presentation

Number of Views:330

Avg rating:3.0/5.0

Slides: 113

Provided by: AlexanderS153

Category:

more less

Transcript and Presenter's Notes

Title: Molecular Signaling

1
Molecular Signaling Drug Development
CourseDevelopment of Molecular Signatures from
High-Throughput Assay Data

Alexander Statnikov, Ph.D.
Director, Computational Causal Discovery
Laboratory
Benchmarking Director, Best Practices Integrative
Informatics Consultation Service
Assistant Professor, Department of Medicine,
Division of Clinical Pharmacology
Center for Health Informatics and Bioinformatics
, NYU School of Medicine
5/16/2011

2
Outline

Part 1 Introduction to molecular signatures
Part 2 Key principles for developing accurate
molecular signatures
Part 3 Comprehensive evaluation of algorithms to
develop molecular signatures for cancer
classification
Part 4 Analysis and computational dissection of
molecular signature multiplicity
Conclusion
Homework assignment

3
Part 1 Introduction to molecular signatures
4
Definition of a molecular signature

Molecular signature is a computational or
mathematical model that links high-dimensional
molecular information to phenotype or other
response variable of interest.

5
FDA view on molecular signatures

The FDA calls them in vitro diagnostic
multivariate index assays
1. Class II Special Controls Guidance Document
Gene Expression Profiling Test System for Breast
Cancer Prognosis
Addresses device classification.
2. The Critical Path to New Medical Products
- Identifies pharmacogenomics as crucial to
advancing medical product development and
personalized medicine.
3. Draft Guidance on Pharmacogenetic Tests and
Genetic Tests for Heritable Markers Guidance
for Industry Pharmacogenomic Data Submissions
Identifies 3 main goals (dose, ADEs, responders),
Defines IVDMIA,
Encourages fault-free sharing of
pharmacogenomic data,
Separates probable from valid biomarkers,
Focuses on genomics (and not other omics).

6
Main uses of molecular signatures

Direct benefits Models of disease
phenotype/clinical outcome
Diagnosis
Prognosis, long-term disease management
Personalized treatment (drug selection,
titration)
Ancillary benefits 1 Biomarkers for diagnosis,
or outcome prediction
Make the above tasks resource efficient, and easy
to use in clinical practice
Helps next-generation molecular imaging
Leads for potential new drug candidates
Ancillary benefits 2 Discovery of structure
mechanisms (regulatory/interaction networks,
pathways, sub-types)
Leads for potential new drug candidates

7
Less conventional uses of molecular signatures

Increase clinical trial sample efficiency and
decrease costs or both, using placebo responder
signatures
In silico signature-based candidate drug
screening
Drug resurrection
Establishing existence of biological signal in
very small sample situations where univariate
signals are too weak
Assess importance of markers and of mechanisms
involving those
Choosing the right animal model
?

8
Recent molecular signatures available for
patient care
Agendia
Clarient
Prediction Sciences
LabCorp
Veridex
University Genomics
Genomic Health
BioTheranostics
Applied Genomics
Power3
Correlogic Systems
9
Molecular signatures in the market
Company Product Disease Purpose
Agendia MammaPrint Breast cancer Risk assessment for the recurrence of distant metastasis in a breast cancer patient.
Agendia TargetPrint Breast cancer Quantitative determination of the expression level of estrogen receptor, progesteron receptor and HER2 genes. This product is supplemental to MammaPrint.
Agendia CupPrint Cancer Determination of the origin of the primary tumor.
University Genomics Breast Bioclassifier Breast cancer Classification of ER-positive and ER-negative breast cancers into expression-based subtypes that more accurately predict patient outcome.
Clarient Insight Dx Breast Cancer Profile Breast cancer Prediction of disease recurrence risk.
Clarient Prostate Gene Expression Profile Prostate cancer Diagnosis of grade 3 or higher prostate cancer.
Prediction Sciences RapidResponse c-Fn Test Stroke Identification of the patients that are safe to receive tPA and those at high risk for HT, to help guide the physicians treatment decision.
Genomic Health OncotypeDx Breast cancer Individualized prediction of chemotherapy benefit and 10-year distant recurrence to inform adjuvant treatment decisions in certain women with early-stage breast cancer.
bioTheranostics CancerTYPE ID Cancer Classification of 39 types of cancer.
bioTheranostics Breast Cancer Index Breast cancer Risk assessment and identification of patients likely to benefit from endocrine therapy, and whose tumors are likely to be sensitive or resistant to chemotherapy.
Applied Genomics MammaStrat Breast cander Risk assessment of cancer recurrence.
Applied Genomics PulmoType Lung cancer Classification of non-small cell lung cancer into adenocarcinoma versus squamous cell carcinoma subtypes.
Applied Genomics PulmoStrat Lung cancer Assessment of an individual's risk of lung cancer recurrence following surgery for helping with adjuvant therapy decisions.
Correlogic OvaCheck Ovarian cancer Early detection of epithelial ovarian cancer.
LabCorp OvaSure Ovarian cancer Assessment of the presence of early stage ovarian cancer in high-risk women.
Veridex GeneSearch BLN Assay Breast cancer Determination of whether breast cancer has spread to the lymph nodes.
Power3 BC-SeraPro Breast cancer Differentiation between breast cancer patients and control subjects.
10
MammaPrint

Developed by Agendia (www.agendia.com)
70-gene signature to stratify women with breast
cancer that hasnt spread into low risk and
high risk for recurrence of the disease
Independently validated in gt1,000 patients
So far performed gt10,000 tests
Cost of the test is 3,000
In February, 2007 the FDA cleared the
MammaPrint test for marketing in the U.S. for
node negative women under 61 years of age with
tumors of less than 5 cm.
TIME Magazines 2007 medical invention of the
year.

11
Oncotype DX

Developed by Genomic Health (www.genomichealth.c
om )
21-gene signature to predict whether a woman
with localized, ER breast cancer is at risk of
relapse
Independently validated in gt1,000 patients
So far performed gt50,000 tests
Cost of the test is 3,000
The following paper shows the health benefits
and cost-effectiveness benefits of using Oncotype
DX http//www3.interscience.wiley.com/cgi-bin/abs
tract/114124513/ABSTRACT

12
Part 2Key principles for developing accurate
molecular signatures
13
Main ingredients for developing a molecular
signature
Well-defined clinical problem access to
patients/ samples
Computational biostatistical Analysis
Molecular Signature
High-throughput assays
14
Challenges in computational analysis of omics
data

Relatively easy to develop a predictive model and
even easier to believe that a model is good when
it is not ? false sense of security
Several problems exist some theoretical and some
practical
Omics data has many special characteristics and
is tricky to analyze!

15
Example OvaCheck (1/2)

Developed by Correlogic (www.correlogic.com)
Blood test for the early detection of epithelial
ovarian cancer
Failed to obtain FDA approval
Looks for subtle changes in patterns among the
tens of thousands of proteins, protein fragments
and metabolites in the blood
Signature developed by genetic algorithm
Significant artifacts in data collection
analysis questioned validity of the signature
Results are not reproducible
Data collected differently for different groups
of patients
http//www.nature.com/nature/journal/v429/n6991/fu
ll/429496a.html

16
Example OvaCheck (2/2)
A
B
C
Figure from Baggerly et al (Bioinformatics,
2004)
D
E
F
17
An early kind of analysis Learning disease
sub-types by clustering patient profiles
p53
Rb
18
Clustering Seeking natural groupings hoping
that they will be useful
p53
Rb
19
E.g., for classification (predict response to
treatment)
p53
Respond to treatment Tx1
Do not Respond to treatment Tx1
Rb
20
Another use of clustering

Cluster genes (instead of patients)
Genes that cluster together may belong to the
same pathways
Genes that cluster apart may be unrelated

21
Unfortunately clustering is a non-specific method
and falls into the one-solution fits all trap
when used for classification
p53
Squamous carcinoma
Adenocarcinoma
Rb
22
Clustering is also non-specific when used to
discover pathways, or other mechanistic
relationships
It is entirely possible in this simple
illustrative counter-example for G3 (a causally
unrelated gene to the phenotype) to be more
strongly associated and thus cluster with the
phenotype (or its surrogate genes) than the true
causal oncogenes G1, G2
G1
G2
Ph
G3
23
Two improved classes of methods

Supervised learning ? classification/molecular
signatures and markers
Regulatory network reverse engineering ? pathways

24
Supervised learning Use the known phenotypes
(a.k.a. class labels) in training data to build
signatures or find markers highly specific for
that phenotype
A
Classifier/ Regression Algorithm
Training samples
B
C
Molecular signature
D
T
Testing/ Validation samples
A1, B1, C1, D1, T1 A2, B2, C2, D2, T2 An, Bn,
Cn, Dn, Tn
Classification Performance
25
Input data for supervised learning methods

Class Label Variables/features

Primary Metastatic Primary Metastatic Metastatic P
rimary Metastatic Metastatic Metastatic Primary Me
tastatic Primary
26
Principles and geometric representation for
supervised learning (1/7)

Want to classify objects as boats and houses.

27
Principles and geometric representation for
supervised learning (2/7)

All objects before the coast line are boats and
all objects after the coast line are houses.
Coast line serves as a decision surface that
separates two classes.

28
Principles and geometric representation for
supervised learning (3/7)
These boats will be misclassified as houses
This house will be misclassified as boat
29
Principles and geometric representation for
supervised learning (4/7)
Longitude
Boat
House
Latitude

The methods that build classification models
(i.e., classification algorithms) operate very
similarly to the previous example.
First all objects are represented geometrically.

30
Principles and geometric representation for
supervised learning (5/7)
Longitude
Boat
House
Latitude
Then the algorithm seeks to find a decision
surface that separates classes of objects
31
Principles and geometric representation for
supervised learning (6/7)
Longitude
These objects are classified as houses
?
?
?
?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as boats if
they fall below the decision surface and as
houses if the fall above it
32
Principles and geometric representation for
supervised learning (7/7)
Longitude
Object 1
Object 2
Object 3
Latitude
33
In 2-D this looks simple but what happens in
higher dimensional data

10,000-50,000 (gene expression microarrays, aCGH,
and early SNP arrays)
gt500,000 (tiled microarrays, SNP arrays)
10,000-300,000 (regular MS proteomics)
gt10,000,000 (LC-MS proteomics)
gt100,000,000 (next-generation sequencing)
This is the curse of dimensionality problem

34
High-dimensionality (especially with small
samples) causes

Some methods do not run at all (classical
regression)
Some methods give bad results (KNN, Decision
trees)
Very slow analysis
Very expensive/cumbersome clinical application
Tends to overfit

35
Two problems Over-fitting Under-fitting

Over-fitting (a model to your data) building a
model that is good in original data but fails to
generalize well to new/unseen data
Under-fitting (a model to your data) building a
model that is poor in both original data and
new/unseen data

36
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
37
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line overfits!
Training Data Future Data
Predictor X
38
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line underfits!
Training Data Future Data
Predictor X
39
Very important concept

Successful data analysis methods balance training
data fit with complexity.
Too complex signature (to fit training data well)
?overfitting (i.e., signature does not
generalize)
Too simplistic signature (to avoid overfitting) ?
underfitting (will generalize but the fit to both
the training and future data will be low and
predictive performance small).

40
The Support Vector Machine (SVM) approach for
building molecular signatures

Support vector machines (SVMs) is a binary
classification algorithm.
SVMs are important because of (a) theoretical
reasons
Robust to very large number of variables and
small samples
Can learn both simple and highly complex
classification models
Employ sophisticated mathematical principles to
avoid overfitting
and (b) superior empirical results.

41
Main ideas of SVMs (1/3)
Gene Y
Cancer patients
Normal patients
Gene X

Consider example dataset described by 2 genes,
gene X and gene Y
Represent patients geometrically (by vectors)

42
Main ideas of SVMs (2/3)
Gene Y
Cancer patients
Normal patients
Gene X

Find a linear decision surface (hyperplane)
that can separate patient classes and has the
largest distance (i.e., largest gap or
margin) between border-line patients (i.e.,
support vectors)

43
Main ideas of SVMs (3/3)

If such linear decision surface does not exist,
the data is mapped into a much higher dimensional
space (feature space) where the separating
decision surface is found
The feature space is constructed via very clever
mathematical projection (kernel trick).

44
On estimation of signature accuracy
test
data
train
Large sample case use hold-out validation
train
train
train
test
train
train
train
data
Small sample case use N-fold cross-validation
test
test
test
test
test
45
Nested N-fold cross-validation
Recall the main idea of cross-validation
data
46
Overview of challenges in computational analysis
of omics data for development of molecular
signatures
Rashomon effect/ Marker multiplicity
Assay validity/ reproducibility
Efficiency Statistical/ Computational
Research Designs
Data Analytics of Molecular Signatures
Is there predictive signal?
Causality vs predictiveness/ Biological
Significance
Methods Development Re-inventing the wheel
specialization
Epistasis
Many variables, small sample, noise, artifacts
Instability
Performance Predictivity, compactness
Protocols/Guidelines
Editorializing/ Over-simplifying/ Sensationalism
47
Part 3Comprehensive evaluation of algorithms to
develop molecular signatures for cancer
classification
48
Comprehensive evaluation of algorithms for
classification of cancer microarray data

Main goals
Find the best performing algorithms for building
molecular signatures for cancer diagnosis from
microarray gene expression data
Investigate benefits of using gene selection and
ensemble classification methods.

49
Classification algorithms

K-Nearest Neighbors (KNN)
Backpropagation Neural Networks (NN)
Probabilistic Neural Networks (PNN)
Multi-Class SVM One-Versus-Rest (OVR)
Multi-Class SVM One-Versus-One (OVO)
Multi-Class SVM DAGSVM
Multi-Class SVM by Weston Watkins (WW)
Multi-Class SVM by Crammer Singer (CS)
Weighted Voting One-Versus-Rest
Weighted Voting One-Versus-One
Decision Trees CART

instance-based
neural networks
kernel-based
voting
decision trees
50
Ensemble classification methods
51
Gene selection methods

Signal-to-noise (S2N) ratio in one-versus-rest
(OVR) fashion
Signal-to-noise (S2N) ratio in one-versus-one
(OVO) fashion
Kruskal-Wallis nonparametric one-way ANOVA (KW)
Ratio of genes between-categories to
within-category sum of squares (BW).

52
Performance metrics andstatistical comparison

Accuracy
can compare to previous studies
easy to interpret simplifies statistical
comparison

2. Relative classifier information (RCI)
easy to interpret simplifies statistical
comparison
not sensitive to distribution of classes
accounts for difficulty of a decision problem

Randomized permutation testing to compare
accuracies
of the classifiers (?0.05)

53
Microarray datasets

Total
1300 samples
74 diagnostic categories
41 cancer types and
12 normal tissue types

54
Summary of methods and datasets
55
Results without gene selection
56
Results with gene selection
Improvement of diagnostic performance by gene
selection (averages for the four datasets)
Diagnostic performance before and after gene
selection
Average reduction of genes is 10-30 times
57
Comparison with previously published results
58
Summary of results

Multi-class SVMs are the best family among the
tested algorithms outperforming KNN, NN, PNN, DT,
and WV.
Gene selection in some cases improves
classification performance of all classifiers,
especially of non-SVM algorithms
Ensemble classification does not improve
performance of SVM and other classifiers
Results obtained by SVMs favorably compare with
the literature.

59
Random Forest (RF) classifiers

Appealing properties
Work when of predictors gt of samples
Embedded gene selection
Incorporate interactions
Based on theory of ensemble learning
Can work with binary multiclass tasks
Does not require much fine-tuning of parameters
Strong theoretical claims
Empirical evidence (Diaz-Uriarte and Alvarez de
Andres, BMC Bioinformatics, 2006) reported
superior classification performance of RFs
compared to SVMs and other methods

60
Key principles of RF classifiers
Testing data
Training data
4) Apply to testing data combine predictions
1) Generate bootstrap samples
2) Random gene selection
3) Fit unpruned decision trees
61
Results without gene selection

SVMs nominally outperform RFs is 15 datasets, RFs
outperform SVMs in 4 datasets, algorithms are
exactly the same in 3 datasets.
In 7 datasets SVMs outperform RFs statistically
significantly.
On average, the performance advantage of SVMs is
0.033 AUC and 0.057 RCI.

62
Results with gene selection

SVMs nominally outperform RFs is 17 datasets, RFs
outperform SVMs in 3 datasets, algorithms are
exactly the same in 2 datasets.
In 1 dataset SVMs outperform RFs statistically
significantly.
On average, the performance advantage of SVMs is
0.028 AUC and 0.047 RCI.

63
Part 4Analysis and computational dissection of
molecular signature multiplicity
64
Molecular signature multiplicity

Different methods or samples from the same
population lead to different but apparently
maximally predictive signatures
Far-reaching implications for biological
discovery and development of next generation
patient diagnostics and personalized treatments
Generation of biological hypotheses is very hard
even when signatures are maximally predictive of
the phenotype since thousands of completely
different signatures are equally consistent with
the data
Produced signatures are not statistically
generalizable to new cases, and thus not reliable
enough for translation to clinical practice.

65
Molecular signature multiplicity

Causes of this phenomenon are unknown several
contradictory conjectures exist in the field
Signature multiplicity is due to small samples
Michiels et al., 2005
Signature multiplicity leads to predictively
non-reproducible signatures Ein-Dor et al.,
2006 building reproducible signatures requires
thousands of samples Ioannidis, 2005
Signature multiplicity is a by-product of the
complex regulatory connectivity of genome
Dougherty and Brun, 2006
Artifacts of data pre-processing, e.g.
normalization Gold et al., 2005 Qiu et al.,
2005 Ploner et al., 2005

66
Major goals

Develop a Markov boundary characterization of
molecular signature multiplicity phenomenon
Design and study algorithms that can correctly
identify the set of maximally predictive and
non-redundant molecular signatures
Conduct an empirical evaluation of the novel
algorithms and compare to the existing
state-of-the-art methods
Test and refine previously stated hypotheses
about the causes of signature multiplicity
phenomenon.

67
Optimality criteria of signatures

Signatures that are focus of this research
satisfy the following two optimality criteria
maximally predictive of the phenotype (they
achieve best predictivity of the phenotype in the
given dataset over all signatures based on
different gene sets)
do not contain predictively redundant genes
(i.e., genes that can be removed from the
signature without adversely affecting its
predictivity).

68
Why do we need algorithms to extract as many
optimal signatures as possible?

A deeper understanding of the signature
multiplicity phenomenon and how it affects
reproducibility of signatures
Improving discovery of the underlying biological
mechanisms by not missing genes that are
implicated biologically in disease processes
Catalyzing regulatory approval by establishing
in-silico equivalence to previously validated
signatures

69
Existing algorithms for multiple signature
extraction Resampling-based methods
Training data
1) Generate resampled datasets (e.g., by
bootstrapping)

2) Apply a standard signature extraction
algorithm (e.g., SVM-RFE)
X1
X2
X3
XN

Based on assumption that multiplicity is strictly
a small-sample phenomenon
An infinite number of resamplings is required to
extract all optimal signatures
May stop producing multiple signatures in large
sample sizes.

70
Existing algorithms for multiple signature
extraction Iterative removal
Original data (for all genes)
Remove corresponding genes from the dataset
X1
Reduced data (excluding X1 genes)
Remove corresponding genes from the dataset
X2
Reduced data (excluding X1 and X2 genes)
Remove corresponding genes from the dataset
X3
until a signature has statistically
significantly reduced predictivity

Agnostic to what causes molecular signature
multiplicity
Cannot discover signatures that have genes in
common.

71
Existing algorithms for multiple signature
extraction Stochastic gene selection
Genetic Algorithms (e.g., GA/KNN or GA/SVM)

Can output all signatures that are discoverable
by a genetic algorithm when it is allowed to
evolve an infinite number of generations.

KIAMB

Stochastic Markov boundary method based on IAMB
algorithm
In a specific class of distributions, every
optimal signature will be output by this method
with nonzero probability
Requires an infinite number of iterations to
discover all optimal signatures will discover
same signature over and over again
Sample requirements are of exponential order to
the number of genes in a signatures.

72
Existing algorithms for multiple signature
extraction Brute-force exhaustive search
LIKNON

Examines predictivity of all individual genes in
the dataset, all pairs of genes, all triples of
genes, and so on
It is infeasible when a signature has more than
2-3 genes
Agnostic to what causes signature multiplicity.

In summary, no current algorithm provides a
systematic and efficient approach for
identification of the set of maximally predictive
and non-redundant molecular signatures that exist
in the underlying distribution.
73
I. Markov boundary characterization of molecular
signature multiplicity
74
Key definitions (1/2)

Definition of maximally predictive molecular
signature A maximally predictive molecular
signature is a molecular signature that maximizes
predictivity of the phenotype relative to all
other signatures that can be constructed from the
same dataset.
Definition of maximally predictive and
non-redundant molecular signature A maximally
predictive and non-redundant molecular signature
based on variables X is a maximally predictive
signature such that any signature based on a
proper subset of variables in X is not maximally
predictive.

75
Key definitions (2/2)

Definition of Markov blanket A Markov blanket M
of the response variable T ? V in the joint
probability distribution P over variables V is a
set of variables conditioned on which all other
variables are independent of T, i.e. for every
,
.
Definition of Market boundary (or non-redundant
Markov blanket) If M is a Markov blanket of T
and no proper subset of M satisfies the
definition of Markov blanket of T, then M is
called a Markov boundary (or non-redundant Markov
blanket) of T.

76
Theoretical results

Variable sets that participate in the maximally
predictive signatures of T are precisely the
Markov blankets of T and vice-versa
Similarly, variable sets that participate in the
maximally predictive and non-redundant signatures
of T are precisely the Markov boundaries of T and
vice-versa
If a joint probability distribution P over
variables V satisfies the intersection property,
then there exists a unique Markov boundary of T
Pearl, 1988.

77
A fundamental reduction used in this research for
the analysis of signatures
S1
S2
S3
S4
S5
Cases
Gene Y
Controls

Signatures that have maximal predictivity of the
phenotype relative to their genes.
Signatures with worse predictivity
Gene X

Since there is an infinite number of signatures
with maximal predictivity, when I refer to a
signature, I mean one of the predictively
equivalent classifiers (e.g., S3 or S4 or S5)
Can study signature classes by reference only to
their genes
This reduction is justified whenever the
classifiers used can learn the minimum error
decision function given sufficient sample.

78
Example of Markov boundary multiplicity
Network structure
Distributional constraints

Many optimal signatures exist e.g., A, C and
B, C are maximally predictive and non-redundant
signatures of T. Furthermore, A, C and B, C
remain maximally predictive even in infinite
samples
The network has very low connectivity
Genes in optimal signatures do not have to be
deterministically related e.g., A and B are not
deterministically related, yet convey
individually the same information about T
If an algorithm selects only one optimal
signature, then there is danger to miss
biologically important causative genes
The union of all optimal signatures includes all
genes located in the local pathway around T
In this example the intersection of all optimal
signatures contains only genes in the local
pathway around T.

79
II. A Novel algorithm to correctly identify the
set of maximally predictive and non-redundant
signatures
80
TIE generative algorithm
81
TIE algorithm for gene expression data analysis
82
Trace of the TIE algorithm
Not a Markov boundary Do not consider any G that
is a superset of F
GF
Mnew A, B
GA
Mnew C, B, F
Markov boundary
M A, B, F
GB
Mnew A, D, E, F
Markov boundary
Mnew C, D, E, F
Markov boundary
GA,B

83
Theoretical results (1/2)

TIE returns all and only Markov boundaries of T
(i.e., maximally predictive and non-redundant
signatures) if its input components X, Y, Z are
admissible
IAMB is an admissible Markov boundary algorithm
(input component X) under assumptions
IAMB correctly outputs a Markov boundary if only
the composition property holds
HITON-PC is an admissible Markov boundary
algorithm (input component X) under assumptions
HITON-PC correctly outputs a Markov boundary if
the adjacency faithfulness assumption holds
except for violations of the intersection axiom,
global Markov condition holds, and there are no
spouses in the Markov boundary

84
Theoretical results (2/2)

Stated three strategies (IncLex, IncMinAssoc, and
IncMaxAssoc) to generate subsets of variables
that have to be removed from V to identify new
Markov boundaries of T and proved their
admissibility (input component Y)
Stated two criteria (Independence and
Predictivity) to verify Markov boundaries and
proved their admissibility (input component Z).

85
III. Empirical evaluation of the novel algorithms
and comparison with existing state-of-the-art
methods
86
A. Experiments with artificial simulated data

Generative model is available, and the set of
Markov boundaries (and thus the set of maximally
predictive and non-redundant signatures) is
known.
Generate samples of systematically varied sizes
Compare to the gold standard
Test whether the TIE algorithm behaves according
to theoretical expectations and study its
empirical properties
Obtain clues about behavior of TIE and baseline
comparison algorithms in experiments with real
gene expression data.

87
Experiments with discrete networks TIED1 and
TIED2

Two artificial discrete networks were created
TIED1 consists of 30 variables (including a
response variable T) and contains 72 Markov
boundaries of T
TIED2 consists of 1,000 variables (including a
response variable T) and contains the same 72
Markov boundaries of T as TIED1.

88
Experiments

Goal Compare TIE to state-of-the-art algorithms
(Resampling-based methods, KIAMB, and Iterative
Removal) and examine sensitivity of the tested
methods to high dimensionality.
Findings
TIE correctly identifies the set of true Markov
boundaries (maximally predictive and
non-redundant signatures) in the datasets with 30
or 1,000 variables
Iterative Removal identifies only 1 signature
KIAMB fails to identify any true signature, and
its output signatures have poor predictivity
Resampling-based methods either miss true
signatures and/or output many redundant variables
in the signatures.

89
Experiments with linear continuous network LIND
LIND consists of 41 variables (including a
response variable T) and contains 12 Markov
boundaries of T.
90
Experiments

Goals
Analyze behavior of TIE as a function of sample
size using data generated from a continuous
network
Compare criteria Independence and Predictivity
for verification of Markov boundaries in the TIE
algorithm.
Findings
As sample size increases, the performance of both
instantiations of TIE generally improves and the
algorithms discover the set of true Markov
boundaries
?-level in the criterion Predictivity
significantly affects the number of Markov
boundaries output by the TIE algorithm
TIE with criterion Predictivity typically leads
to a larger number of output Markov boundaries
and on average superior performance compared to
criterion Independence.

91
Experiments with discrete network XORD
XORD consists of 41 variables (including a
response variable T) and contains 25 Markov
boundaries of T.
92
Experiments

Goal Evaluate TIE when the popular Markov
boundary algorithms such as IAMB and HITON-PC are
not applicable due to violations of their
fundamental assumptions.
Findings
TIE discovers the set of true Markov boundaries
when the sample is 2,000
There is 1 false positive variable in each
discovered Markov boundary for large sample sizes.

93
B. Experiments with resimulated microarray gene
expression data

Resimulated data by design closely resembles real
human lung cancer microarray gene expression
data
The knowledge of a generative model allows to
generate arbitrary large samples and study
behavior of TIE as a function of sample size
Unlike prior experiments with artificial
simulated datasets, the set of maximally
predictive and non-redundant signatures is not
known a priori.

94
Experiment
Goal Examine whether the signature multiplicity
phenomenon vanishes as the sample size grows.
Results
95
Findings of other experiments

TIE is not sensitive to the choice of the
initial signature discovered by the algorithm
Post-processing TIE signatures with wrapping
results in more signatures with smaller number of
genes
Signatures output by tested non-TIE methods are
either redundant or have inferior predictivity
compared to signatures output by TIE techniques.

96
C. Experiments with real human microarray gene
expression data

Independent-Dataset Experiments Using pairs of
microarray datasets either from different
laboratories or different platforms
Single-Dataset Experiments Additional
experiments with relatively large sample size
microarray datasets
The primary goal of both experiments is to
compare TIE and baseline algorithms for multiple
signature extraction in terms of maximal
predictivity? of induced signatures and
reproducibility in independent data.
Operational definition of maximal predictivity
Empirically best classification performance (AUC)
achievable in each dataset over all tested
methods consideration.

97
Independent-dataset experiments Datasets
Task Discovery dataset Discovery dataset Discovery dataset Discovery dataset Validation dataset Validation dataset Validation dataset Validation dataset Number of common genes
Task Sample size Samples per class Number of genes Microarray platform Sample size Samples per class Number of genes Microarray platform Number of common genes
Lung Cancer Diagnosis lung tumors vs. normals (non-tumor lung samples) 203 lung tumors (186)normals (17) 12600 Affymetrix U95A 96 lung tumors (86)normals (10) 7129 Affymetrix HuGeneFL 7094
Lung Cancer Subtype Classification adenocarcinoma vs. squamous cell carcinoma lung tumors 160 adenocarcinoma (139)squamous (21) 12600 Affymetrix U95A 28 adenocarcinoma (14)squamous (14) 12533 Affymetrix U95A 12533
Breast Cancer Subtype Classification estrogen receptor positive (ER) vs. ER- breast tumors untreated patients 286 ER (209)ER- (77) 22283 Affymetrix U133A 119 ER (85)ER- (34) 22283 Affymetrix U133A 22283
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 204 poor prognosis (66)good prognosis (138) 22283 Affymetrix U133A 72 poor prognosis (13)good prognosis (59) 22283 Affymetrix U133A 22283
Glioma Subtype Classification grade III vs. grade IV glioma tumors 100 grade III (24)grade IV (76) 22283 Affymetrix U133A 85 grade III (26)grade IV (59) 22283 Affymetrix U133A 22283
Leukemia 5 Yr. Prognosis patients with disease-free survival lt 5 years (ones who had relapse or competing events within 5 years) vs. gt 5 years 164 survival lt 5 yr. (29)survival gt 5 yr. (135) 12625 Affymetrix U95A 79 survival lt 5 yr. (18)survival gt 5 yr. (61) 22283 Affymetrix U133A 10507
98
Detailed results (1/3)
99
Detailed results (2/3)
100
Detailed results (3/3)
101
TIE signatures have maximal predictivity

TIE achieves maximal predictivity in 5 out of 6
validation datasets
Non-TIE methods achieve maximal predictivity in
0 to 2 datasets depending on the method
In the dataset where the predictivity of TIE is
statistically distinguishable from the
empirically maximal one (Lung Cancer Subtype
Classification), the magnitude of this difference
is only 0.009 AUC on average over all discovered
signatures.

102
TIE signatures are reproducible, other
signatures may be overfitted

TIE has no overfitting on average over all
signatures and datasets
Other methods achieve predictivity in the
validation data that is lower than one in the
discovery data (by 0.02-0.03 AUC), besides having
inferior predictivity

103
TIE signatures in comparison with other
signatures
Predictivity results for Leukemia 5 Yr. Prognosis
task
Classification performance (AUC) in discovery
dataset
Each dot in the plot corresponds to a signature
(computational model) of the outcome E.g.,
Outcome(x)Sign(wxb), where x, w ? ?m, b ? ?,
m is the number of genes in the signature.
Classification performance (AUC) in validation
dataset
104
Single-dataset experiments Datasets
Task Sample size Samples per class Number of genes Microarray platform
Lymphoma Subtype Classification I Diffuse large-B-cell lymphoma (DLBCL) vs. Burkitt's lymphoma (BL) patients 303 DLBCL (258)BL (45) 2745 Human LymphDx 2.7k GeneChip
Lymphoma Subtype Classification II Diffuse large-B-cell lymphoma (DLBCL) vs. mediastinal large B-cell (MLBCL) patients 210 DLBCL (176)MLBCL (34) 32403 (44928) Affymetrix U133A and U133B
Breast Cancer Subtype Classification I p53 mutant vs. wild-type breast tumors 251 p53 mutant (58)p53 wild-type (193) 22283 Affymetrix U133A
Breast Cancer Subtype Classification II estrogen receptor positive (ER) vs. ER- breast tumors 247 ER (213)ER- (34) 22283 Affymetrix U133A
Breast Cancer Subtype Classification III progesterone receptor positive (PgR) vs. PgR- breast tumors 251 PgR (190)PgR- (61) 22283 Affymetrix U133A
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 215 poor prognosis (51)good prognosis (164) 24496 Agilent Hu25K
Bladder Cancer Stage Classification stage Ta. vs. other stages (T1, T2, T3, T4) of bladder tumors 404 stage Ta (189)other stages (215) 1381 (3072) MDL Human 3k

Validation dataset ? subset of 100
samples/patients
Discovery dataset ? all remaining
samples/patients
Repeat splits into discovery validation
datasets 10 times to minimize variance

105
Single-dataset experiments Summary results

Results are similar to the ones from
independent-dataset experiments
TIE achieves maximal predictivity in 6 out of 7
validation datasets
Non-TIE methods achieve maximal predictivity in
0 to 1 datasets depending on the method
In the dataset where TIE has predictivity that
is statistically distinguishable from the
empirically maximal one (Breast Cancer Subtype
Classification II), the magnitude of this
difference is only lt0.01 AUC on average over all
discovered signatures.

106
IV. Discussion and interpretation of results
107
Revisiting previously published hypotheses about
signature multiplicity

Signature reproducibility neither precludes
multiplicity nor requires sample sizes with
thousands of subjects
Multiplicity of signatures does not require dense
connectivity
Noisy measurements or normalization are not
necessary conditions for signature multiplicity
Multiplicity can be produced by a combination of
small sample size-related variance and intrinsic
multiplicity in the underlying network
Multiple signatures output by TIE are
reproducible even though they are derived from
small sample, noisy, and heavily-processed data.

108
A more complete picture is emerging regarding
causes of multiplicity...

Intrinsic information redundancy in the
underlying biological system
Variability in the output of gene selection and
classifier algorithms especially in small sample
sizes
Small sample statistical indistinguishability of
signatures with different large sample
predictivity and/or redundancy characteristics
Presence of hidden variables
Correlated measurement noise
RNA amplification techniques that systematically
distort measurements of transcript ratios
Cellular aggregation and sampling from mixtures
of distributions that affect inference of
conditional independence relations
Normalization and other data pre-processing
methods that artificially increase correlations
among genes
Engineered redundancy in the assay technology
platforms.

109
Summary of results

Developed a Markov boundary characterization of
molecular signature multiplicity
Designed a generative algorithm that can
correctly identify the set of maximally
predictive and non-redundant molecular signatures
in principle independently of data distribution
Conducted an empirical evaluation of the novel
algorithm and compared it to existing
state-of-the-art methods using artificial
simulated, resimulated microarray gene
expression, and real human microarray gene
expression data
Tested and refined several hypotheses about the
causes of molecular signature multiplicity
phenomenon.

110
General conclusions

Molecular signatures play a crucial role in
personalized medicine and translational
bioinformatics.
Molecular signatures are being used to treat
patients today, not in the future.
Development of accurate molecular signature
should rely on use of supervised methods.
In general, there are many challenges for
computational analysis of omics data for
development of molecular signatures.
One of these challenges is molecular signature
multiplicity.
There exist an algorithm that can extract the set
of maximally predictive and non-redundant
molecular signatures from high-throughput data.

111
Homework (Due next Monday)

Read the paper Analysis and Computational
Dissection of Molecular Signature Multiplicity.
Describe a novel and interesting application area
for TIE algorithm. Feel free to use and example
from your research where there exist many
molecular signatures of some response variable
(1/2 page max).
Come up with another cause of molecular signature
multiplicity that was not mentioned in the paper
(1/2 page max).
Email your work to Alexander.Statnikov_at_med.nyu.edu

112
Computational Causal Discovery Laboratory at NYU
Center for Health Informatics and Bioinformatics
(CHIBI)

The purpose of our lab is to develop, test and
apply computational causal discovery methods
suitable for molecular, clinical, imaging and
multi-modal data of high-dimensionality.
We are interested in methods to address the
following questions
What is causing disease/phenotype?
What are the effects of disease/phenotype?
What are involved biological pathways?
How to design drugs/treatments?
How genotype causes differences in response to
treatment?
How the environment modifies or even supersedes
the normal causal function of genes and other
molecular variables?
How genes and proteins are organized in complex
causal regulatory networks?
Questions? Email to Alexander.Statnikov_at_med.nyu.ed
u