Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data

About This Presentation

Title:

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data

Description:

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 71

Provided by: Fuji203

Learn more at: https://brb.nci.nih.gov

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data

1
Statistical Aspects of the Development and
Validation of Predictive Classifiers for High
Dimensional Data

Richard Simon, D.Sc.
Chief, Biometric Research Branch
National Cancer Institute
Linus.nci.nih.gov/brb

2
BRB Websitehttp//linus.nci.nih.gov/brb

Powerpoint presentations and audio files
Reprints Technical Reports
BRB-ArrayTools software
BRB-ArrayTools Data Archive
Sample Size Planning for Targeted Clinical Trials

3
DNA Microarray Gene Expression Assay

Extract mRNA from cells of interest
Each mRNA molecule was transcribed from a single
gene and it has a linear structure complementary
to that gene
1 mRNA molecule is translated into one protein
molecule
Reverse transcribe mRNA to cDNA introducing a
fluorescently labeled dye to each molecule
Distribute the cDNA sample to a solid surface
containing probes of DNA representing all
genes
the probes are arranged in a grid on the surface
with each gene having a known address
Let the molecules from the sample hybridize with
the probes for the corresponding genes
Wash off excess sample and illuminate surface
with laser with frequency corresponding to the
dye
Measure intensity of fluorescence over each probe

4
Resulting Data

Intensity over a probe is approximately
proportional to abundance of mRNA molecules in
the sample for the gene corresponding to the
probe
40,000 variables measured for each specimen

5
Good Microarray Studies Have Clear Objectives

Class Comparison (Gene Finding)
Find genes whose expression differs among
predetermined classes
Tumor versus normal
After infection of cells by virus versus before
infection
Class Prediction
Prediction of predetermined class using gene
expression profile
Predicting whether a patient will or will not
respond to a given treatment
Class Discovery
Discover clusters of specimens having similar
expression profiles
Find genes that are in the same biochemical
pathways

6
Class Comparison and Class Prediction

Not clustering problems
Supervised methods

7
Components of Class Prediction

Feature (gene) selection
Which genes will be included in the model
Select model type
E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor,
Fitting parameters (regression coefficients) for
model
Selecting value of tuning parameters

8
Simple Gene Selection

Select genes that are differentially expressed
among the classes at a significance level ? (e.g.
0.01)
The ? level is a tuning parameter
Number of false discoveries is not of direct
relevance for prediction
For prediction it is usually more serious to
exclude an informative variable than to include
some noise variables

9
Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
2d/s n10 n30 n50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
10
Complex Gene Selection

Select a set of genes which together give most
accurate predictions
Genetic algorithms
Little evidence that complex feature selection is
useful in microarray problems

11
Linear Classifiers for Two Classes
12
Linear Classifiers for Two Classes

Fisher linear discriminant analysis
Diagonal linear discriminant analysis (DLDA)
Compound covariate predictor
Golubs weighted voting method
Support vector machines with inner product kernel
Perceptrons

13
Fisher LDA
14
The Compound Covariate Predictor (CCP)

Motivated by J. Tukey, Controlled Clinical
Trials, 1993
A compound covariate is built from the basic
covariates (log-ratios)
tj is the two-sample t-statistic for gene j.
xij is the log-expression measure of sample i for
gene j.
Sum is over selected genes.
Threshold of classification midpoint of the CCP
means for the two classes.

15
Linear Classifiers for Two Classes

Compound covariate predictor
Instead of for DLDA

16
Support Vector Machine
17
Perceptrons

Perceptrons are neural networks with no hidden
layer and linear transfer functions between input
output
Number of input nodes equals number of genes
selected
Number of output nodes equals number of classes
minus 1
Number of inputs may be major principal
components of genes or major principal components
of informative genes
Perceptrons are linear classifiers

18
When pgtgtn

It is always possible to find a set of features
and a weight vector for which the classification
error on the training set is zero.
There is generally not sufficient information in
pgtgtn training sets to effectively use more
complex methods

19
Myth

Complex classification algorithms such as neural
networks perform better than simpler methods for
class prediction.

Comparative studies have indicated that simpler
methods usually work as well or better for
microarray problems because they avoid
overfitting the data.

21
Other Simple Methods

Nearest neighbor classification
Nearest k-neighbors
Nearest centroid classification
Shrunken centroid classification

22
Nearest Neighbor Classifier

To classify a sample in the validation set as
being in outcome class 1 or outcome class 2,
determine which sample in the training set its
gene expression profile is most similar to.
Similarity measure used is based on genes
selected as being univariately differentially
expressed between the classes
Correlation similarity or Euclidean distance
generally used
Classify the sample as being in the same class as
its nearest neighbor in the training set

23
Evaluating a Classifier

Most statistical methods were not developed for
pgtgtn prediction problems
Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data
Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy
Testing whether analysis of independent data
results in selection of the same set of genes is
not an appropriate test of predictive accuracy of
a classifier

24
Internal Validation of a Classifier

Re-substitution estimate
Develop classifier on dataset, test predictions
on same data
Very biased for pgtgtn
Split-sample validation
Cross-validation

25
Split-Sample Evaluation

Training-set
Used to select features, select model type,
determine parameters and cut-off thresholds
Test-set
Withheld until a single model is fully specified
using the training-set.
Fully specified model is applied to the
expression profiles in the test-set to predict
class labels.
Number of errors is counted

26
Leave-one-out Cross Validation

Omit sample 1
Develop multivariate classifier from scratch on
training set with sample 1 omitted
Predict class for sample 1 and record whether
prediction is correct

27
Leave-one-out Cross Validation

Repeat analysis for training sets with each
single sample omitted one at a time
e number of misclassifications determined by
cross-validation
Subdivide e for estimation of sensitivity and
specificity

With proper cross-validation, the model must be
developed from scratch for each leave-one-out
training set. This means that feature selection
must be repeated for each leave-one-out training
set.
Simon R, Radmacher MD, Dobbin K, McShane LM.
Pitfalls in the analysis of DNA microarray data.
Journal of the National Cancer Institute
9514-18, 2003.
The cross-validated estimate of misclassification
error is an estimate of the prediction error for
model fit using specified algorithm to full
dataset

29
Prediction on Simulated Null Data

Generation of Gene Expression Profiles
14 specimens (Pi is the expression profile for
specimen i)
Log-ratio measurements on 6000 genes
Pi MVN(0, I6000)
Can we distinguish between the first 7 specimens
(Class 1) and the last 7 (Class 2)?
Prediction Method
Compound covariate prediction
Compound covariate built from the log-ratios of
the 10 most differentially expressed genes.

30
(No Transcript)
31
(No Transcript)
32
Major Flaws Found in 40 Studies Published in 2004

Inadequate control of multiple comparisons in
gene finding
9/23 studies had unclear or inadequate methods to
deal with false positives
10,000 genes x .05 significance level 500 false
positives
Misleading report of prediction accuracy
12/28 reports based on incomplete
cross-validation
Misleading use of cluster analysis
13/28 studies invalidly claimed that expression
clusters based on differentially expressed genes
could help distinguish clinical outcomes
50 of studies contained one or more major flaws

33
Myth

Split sample validation is superior to LOOCV or
10-fold CV for estimating prediction error

34
(No Transcript)
35
Comparison of Internal Validation
MethodsMolinaro, Pfiffer Simon

For small sample sizes, LOOCV is much more
accurate than split-sample validation
Split sample validation over-estimates prediction
error
For small sample sizes, LOOCV is preferable to
10-fold, 5-fold cross-validation or repeated
k-fold versions
For moderate sample sizes, 10-fold is preferable
to LOOCV
Some claims for bootstrap resampling for
estimating prediction error are not valid for
pgtgtn problems

36
(No Transcript)
37
Simulated Data40 cases, 10 genes selected from
5000
Method Estimate Std Deviation
True .078
Resubstitution .007 .016
LOOCV .092 .115
10-fold CV .118 .120
5-fold CV .161 .127
Split sample 1-1 .345 .185
Split sample 2-1 .205 .184
.632 bootstrap .274 .084
38
Simulated Data40 cases
Method Estimate Std Deviation
True .078
10-fold .118 .120
Repeated 10-fold .116 .109
5-fold .161 .127
Repeated 5-fold .159 .114
Split 1-1 .345 .185
Repeated split 1-1 .371 .065
39
DLBCL Data
Method Bias Std Deviation MSE
LOOCV -.019 .072 .008
10-fold CV -.007 .063 .006
5-fold CV .004 .07 .007
Split 1-1 .037 .117 .018
Split 2-1 .001 .119 .017
.632 bootstrap -.006 .049 .004
40
Permutation Distribution of Cross-validated
Misclassification Rate of a Multivariate
Classifier

Randomly permute class labels and repeat the
entire cross-validation
Re-do for all (or 1000) random permutations of
class labels
Permutation p value is fraction of random
permutations that gave as few misclassifications
as e in the real data

41
Gene-Expression Profiles in Hereditary Breast
Cancer

Breast tumors studied
7 BRCA1 tumors
8 BRCA2 tumors
7 sporadic tumors
Log-ratios measurements of 3226 genes for each
tumor after initial data filtering

RESEARCH QUESTION Can we distinguish BRCA1 from
BRCA1 cancers and BRCA2 from BRCA2 cancers
based solely on their gene expression profiles?
42
BRCA1
43
BRCA2
44
Classification of BRCA2 Germline Mutations
Classification Method LOOCV Prediction Error
Compound Covariate Predictor 14
Fisher LDA 36
Diagonal LDA 14
1-Nearest Neighbor 9
3-Nearest Neighbor 23
Support Vector Machine (linear kernel) 18
Classification Tree 45
45
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?

Some publications fit logistic model to standard
covariates and the cross-validated predictions of
expression profile classifiers
This is valid only with split-sample analysis
because the cross-validated predictions are not
independent

46
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?

Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors
Predictive accuracy and inference are different
The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic variables

47
Survival Risk Group Prediction

LOOCV loop
Create training set by omitting ith case
Develop PH model for training set
Compute predictive index for ith case using PH
model developed for training set
Compute percentile of predictive index for ith
case among predictive indices for cases in the
training set

48
Survival Risk Group Prediction

Plot Kaplan Meier survival curves for cases with
predictive index percentiles above 50 and for
cases with cross-validated risk percentiles below
50
Or for however many risk groups and thresholds is
desired
Compute log-rank statistic comparing the
cross-validated Kaplan Meier curves

49
Survival Risk Group Prediction

Evaluate individual genes by fitting single
variable proportional hazards regression models
to log expression for each gene
Select genes based on p-value threshold for
single gene PH regressions
Compute first k principal components of the
selected genes
Fit PH regression model with the k pcs as
predictors. Let b1 , , bk denote the estimated
regression coefficients
To predict for case with expression profile
vector x, compute the k supervised pcs y1 , ,
yk and the predictive index ? b1 y1 bk yk

50
Survival Risk Group Prediction

Repeat the entire procedure for permutations of
survival times and censoring indicators to
generate the null distribution of the log-rank
statistic
The usual chi-square null distribution is not
valid because the cross-validated risk
percentiles are correlated among cases
Evaluate statistical significance of the
association of survival and expression profiles
by referring the log-rank statistic for the
unpermuted data to the permutation null
distribution

51
Sample Size Planning References

K Dobbin, R Simon. Sample size determination in
microarray experiments for class comparison and
prognostic classification. Biostatistics 627-38,
2005
K Dobbin, R Simon. Sample size planning for
developing classifiers using high dimensional DNA
microarray data. Biostatistics 2007

52
Sample Size Planning for Classifier Development

The expected value (over training sets) of the
probability of correct classification PCC(n)
should be within ? of the maximum achievable
PCC(?)

53
Probability Model

Two classes
Log expression or log ratio MVN in each class
with common covariance matrix
m differentially expressed genes
p-m noise genes
Expression of differentially expressed genes are
independent of expression for noise genes
All differentially expressed genes have same
inter-class mean difference 2?
Common variance for differentially expressed
genes and for noise genes

54
Classifier

Feature selection based on univariate t-tests for
differential expression at significance level ?
Simple linear classifier with equal weights
(except for sign) for all selected genes. Power
for selecting each of the informative genes that
are differentially expressed by mean difference
2? is 1-?(n)

For 2 classes of equal prevalence, let ?1 denote
the largest eigenvalue of the covariance matrix
of informative genes. Then

56
(No Transcript)
57
(No Transcript)
58
Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
2d/s n10 n30 n50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
59
(No Transcript)
60
(No Transcript)
61
BRB-ArrayTools

Contains analysis tools that I have selected as
valid and useful
Analysis wizard and multiple help screens for
biomedical scientists
Imports data from all platforms and major
databases
Automated import of data from NCBI Gene Express
Omnibus

62
Predictive Classifiers in BRB-ArrayTools

Classifiers
Diagonal linear discriminant
Compound covariate
Bayesian compound covariate
Support vector machine with inner product kernel
K-nearest neighbor
Nearest centroid
Shrunken centroid (PAM)
Random forrest
Tree of binary classifiers for k-classes
Survival risk-group
Supervised pcs

Feature selection options
Univariate t/F statistic
Hierarchical variance option
Restricted by fold effect
Univariate classification power
Recursive feature elimination
Top-scoring pairs
Validation methods
Split-sample
LOOCV
Repeated k-fold CV
.632 bootstrap

63
Selected Features of BRB-ArrayTools

Multivariate permutation tests for class
comparison to control number and proportion of
false discoveries with specified confidence level
Permits blocking by another variable, pairing of
data, averaging of technical replicates
SAM
Fortran implementation 7X faster than R versions
Extensive annotation for identified genes
Internal annotation of NetAffx, Source, Gene
Ontology, Pathway information
Links to annotations in genomic databases
Find genes correlated with quantitative factor
while controlling number of proportion of false
discoveries
Find genes correlated with censored survival
while controlling number or proportion of false
discoveries
Analysis of variance

64
Selected Features of BRB-ArrayTools

Gene set enrichment analysis.
Gene Ontology groups, signaling pathways,
transcription factor targets, micro-RNA putative
targets
Automatic data download from Broad Institute
KS LS test statistics for null hypothesis that
gene set is not enriched
Hotellings and Goemans Global test of null
hypothesis that no genes in set are
differentially expressed
Goemans Global test for survival data
Class prediction
Multiple classifiers
Complete LOOCV, k-fold CV, repeated k-fold, .632
bootstrap
permutation significance of cross-validated error
rate

65
Selected Features of BRB-ArrayTools

Survival risk-group prediction
Supervised principal components with and without
clinical covariates
Cross-validated Kaplan Meier Curves
Permutation test of cross-validated KM curves
Clustering tools for class discovery with
reproducibility statistics on clusters
Internal access to Eisens Cluster and Treeview
Visualization tools including rotating 3D
principal components plot exportable to
Powerpoint with rotation controls
Extensible via R plug-in feature
Tutorials and datasets

66
BRB-ArrayTools

Extensive built-in gene annotation and linkage to
gene annotation websites
Publicly available for non-commercial use
http//linus.nci.nih.gov/brb

67
BRB-ArrayToolsMay 2007

7188 Registered users
1962 Distinct institutions
68 Countries
365 Citations
Registered users
3951 in US
565 at NIH
275 at NCI
2014 US EDU
754 US Govt (non NIH)
3237 Foreign

68
Countries With Most BRB ArrayTools Registered
Users

France 270
Canada 269
UK 244
Germany 239
Italy 216
Taiwan 196
Netherlands 177
Korea 168
Japan 153
China 150

Spain 146
Australia 130
India 107
Belgium 83
New Zeland 61
Sweden 50
Singapore 46
Brazil 48
Israel 41
Denmark 40

69
(No Transcript)
70
Acknowledgements

Kevin Dobbin
Alain Dupuy
Wenyu Jiang
Annette Molinaro
Ruth Pfeiffer
Michael Radmacher
Joanna Shih
Yingdong Zhao
BRB-ArrayTools Development Team

Write a Comment

User Comments (0)

About PowerShow.com

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data - PowerPoint PPT Presentation

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch – PowerPoint PPT presentation