Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data

Description:

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch – PowerPoint PPT presentation

Number of Views:182
Avg rating:3.0/5.0
Slides: 71
Provided by: Fuji203
Learn more at: https://brb.nci.nih.gov
Category:

less

Transcript and Presenter's Notes

Title: Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data


1
Statistical Aspects of the Development and
Validation of Predictive Classifiers for High
Dimensional Data
  • Richard Simon, D.Sc.
  • Chief, Biometric Research Branch
  • National Cancer Institute
  • Linus.nci.nih.gov/brb

2
BRB Websitehttp//linus.nci.nih.gov/brb
  • Powerpoint presentations and audio files
  • Reprints Technical Reports
  • BRB-ArrayTools software
  • BRB-ArrayTools Data Archive
  • Sample Size Planning for Targeted Clinical Trials

3
DNA Microarray Gene Expression Assay
  • Extract mRNA from cells of interest
  • Each mRNA molecule was transcribed from a single
    gene and it has a linear structure complementary
    to that gene
  • 1 mRNA molecule is translated into one protein
    molecule
  • Reverse transcribe mRNA to cDNA introducing a
    fluorescently labeled dye to each molecule
  • Distribute the cDNA sample to a solid surface
    containing probes of DNA representing all
    genes
  • the probes are arranged in a grid on the surface
    with each gene having a known address
  • Let the molecules from the sample hybridize with
    the probes for the corresponding genes
  • Wash off excess sample and illuminate surface
    with laser with frequency corresponding to the
    dye
  • Measure intensity of fluorescence over each probe

4
Resulting Data
  • Intensity over a probe is approximately
    proportional to abundance of mRNA molecules in
    the sample for the gene corresponding to the
    probe
  • 40,000 variables measured for each specimen

5
Good Microarray Studies Have Clear Objectives
  • Class Comparison (Gene Finding)
  • Find genes whose expression differs among
    predetermined classes
  • Tumor versus normal
  • After infection of cells by virus versus before
    infection
  • Class Prediction
  • Prediction of predetermined class using gene
    expression profile
  • Predicting whether a patient will or will not
    respond to a given treatment
  • Class Discovery
  • Discover clusters of specimens having similar
    expression profiles
  • Find genes that are in the same biochemical
    pathways

6
Class Comparison and Class Prediction
  • Not clustering problems
  • Supervised methods

7
Components of Class Prediction
  • Feature (gene) selection
  • Which genes will be included in the model
  • Select model type
  • E.g. Diagonal linear discriminant analysis,
    Nearest-Neighbor,
  • Fitting parameters (regression coefficients) for
    model
  • Selecting value of tuning parameters

8
Simple Gene Selection
  • Select genes that are differentially expressed
    among the classes at a significance level ? (e.g.
    0.01)
  • The ? level is a tuning parameter
  • Number of false discoveries is not of direct
    relevance for prediction
  • For prediction it is usually more serious to
    exclude an informative variable than to include
    some noise variables

9
Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
2d/s n10 n30 n50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
10
Complex Gene Selection
  • Select a set of genes which together give most
    accurate predictions
  • Genetic algorithms
  • Little evidence that complex feature selection is
    useful in microarray problems

11
Linear Classifiers for Two Classes
12
Linear Classifiers for Two Classes
  • Fisher linear discriminant analysis
  • Diagonal linear discriminant analysis (DLDA)
  • Compound covariate predictor
  • Golubs weighted voting method
  • Support vector machines with inner product kernel
  • Perceptrons

13
Fisher LDA
14
The Compound Covariate Predictor (CCP)
  • Motivated by J. Tukey, Controlled Clinical
    Trials, 1993
  • A compound covariate is built from the basic
    covariates (log-ratios)
  • tj is the two-sample t-statistic for gene j.
  • xij is the log-expression measure of sample i for
    gene j.
  • Sum is over selected genes.
  • Threshold of classification midpoint of the CCP
    means for the two classes.

15
Linear Classifiers for Two Classes
  • Compound covariate predictor
  • Instead of for DLDA

16
Support Vector Machine
17
Perceptrons
  • Perceptrons are neural networks with no hidden
    layer and linear transfer functions between input
    output
  • Number of input nodes equals number of genes
    selected
  • Number of output nodes equals number of classes
    minus 1
  • Number of inputs may be major principal
    components of genes or major principal components
    of informative genes
  • Perceptrons are linear classifiers

18
When pgtgtn
  • It is always possible to find a set of features
    and a weight vector for which the classification
    error on the training set is zero.
  • There is generally not sufficient information in
    pgtgtn training sets to effectively use more
    complex methods

19
Myth
  • Complex classification algorithms such as neural
    networks perform better than simpler methods for
    class prediction.

20
  • Comparative studies have indicated that simpler
    methods usually work as well or better for
    microarray problems because they avoid
    overfitting the data.

21
Other Simple Methods
  • Nearest neighbor classification
  • Nearest k-neighbors
  • Nearest centroid classification
  • Shrunken centroid classification

22
Nearest Neighbor Classifier
  • To classify a sample in the validation set as
    being in outcome class 1 or outcome class 2,
    determine which sample in the training set its
    gene expression profile is most similar to.
  • Similarity measure used is based on genes
    selected as being univariately differentially
    expressed between the classes
  • Correlation similarity or Euclidean distance
    generally used
  • Classify the sample as being in the same class as
    its nearest neighbor in the training set

23
Evaluating a Classifier
  • Most statistical methods were not developed for
    pgtgtn prediction problems
  • Fit of a model to the same data used to develop
    it is no evidence of prediction accuracy for
    independent data
  • Demonstrating statistical significance of
    prognostic factors is not the same as
    demonstrating predictive accuracy
  • Testing whether analysis of independent data
    results in selection of the same set of genes is
    not an appropriate test of predictive accuracy of
    a classifier

24
Internal Validation of a Classifier
  • Re-substitution estimate
  • Develop classifier on dataset, test predictions
    on same data
  • Very biased for pgtgtn
  • Split-sample validation
  • Cross-validation

25
Split-Sample Evaluation
  • Training-set
  • Used to select features, select model type,
    determine parameters and cut-off thresholds
  • Test-set
  • Withheld until a single model is fully specified
    using the training-set.
  • Fully specified model is applied to the
    expression profiles in the test-set to predict
    class labels.
  • Number of errors is counted

26
Leave-one-out Cross Validation
  • Omit sample 1
  • Develop multivariate classifier from scratch on
    training set with sample 1 omitted
  • Predict class for sample 1 and record whether
    prediction is correct

27
Leave-one-out Cross Validation
  • Repeat analysis for training sets with each
    single sample omitted one at a time
  • e number of misclassifications determined by
    cross-validation
  • Subdivide e for estimation of sensitivity and
    specificity

28
  • With proper cross-validation, the model must be
    developed from scratch for each leave-one-out
    training set. This means that feature selection
    must be repeated for each leave-one-out training
    set.
  • Simon R, Radmacher MD, Dobbin K, McShane LM.
    Pitfalls in the analysis of DNA microarray data.
    Journal of the National Cancer Institute
    9514-18, 2003.
  • The cross-validated estimate of misclassification
    error is an estimate of the prediction error for
    model fit using specified algorithm to full
    dataset

29
Prediction on Simulated Null Data
  • Generation of Gene Expression Profiles
  • 14 specimens (Pi is the expression profile for
    specimen i)
  • Log-ratio measurements on 6000 genes
  • Pi MVN(0, I6000)
  • Can we distinguish between the first 7 specimens
    (Class 1) and the last 7 (Class 2)?
  • Prediction Method
  • Compound covariate prediction
  • Compound covariate built from the log-ratios of
    the 10 most differentially expressed genes.

30
(No Transcript)
31
(No Transcript)
32
Major Flaws Found in 40 Studies Published in 2004
  • Inadequate control of multiple comparisons in
    gene finding
  • 9/23 studies had unclear or inadequate methods to
    deal with false positives
  • 10,000 genes x .05 significance level 500 false
    positives
  • Misleading report of prediction accuracy
  • 12/28 reports based on incomplete
    cross-validation
  • Misleading use of cluster analysis
  • 13/28 studies invalidly claimed that expression
    clusters based on differentially expressed genes
    could help distinguish clinical outcomes
  • 50 of studies contained one or more major flaws

33
Myth
  • Split sample validation is superior to LOOCV or
    10-fold CV for estimating prediction error

34
(No Transcript)
35
Comparison of Internal Validation
MethodsMolinaro, Pfiffer Simon
  • For small sample sizes, LOOCV is much more
    accurate than split-sample validation
  • Split sample validation over-estimates prediction
    error
  • For small sample sizes, LOOCV is preferable to
    10-fold, 5-fold cross-validation or repeated
    k-fold versions
  • For moderate sample sizes, 10-fold is preferable
    to LOOCV
  • Some claims for bootstrap resampling for
    estimating prediction error are not valid for
    pgtgtn problems

36
(No Transcript)
37
Simulated Data40 cases, 10 genes selected from
5000
Method Estimate Std Deviation
True .078
Resubstitution .007 .016
LOOCV .092 .115
10-fold CV .118 .120
5-fold CV .161 .127
Split sample 1-1 .345 .185
Split sample 2-1 .205 .184
.632 bootstrap .274 .084
38
Simulated Data40 cases
Method Estimate Std Deviation
True .078
10-fold .118 .120
Repeated 10-fold .116 .109
5-fold .161 .127
Repeated 5-fold .159 .114
Split 1-1 .345 .185
Repeated split 1-1 .371 .065
39
DLBCL Data
Method Bias Std Deviation MSE
LOOCV -.019 .072 .008
10-fold CV -.007 .063 .006
5-fold CV .004 .07 .007
Split 1-1 .037 .117 .018
Split 2-1 .001 .119 .017
.632 bootstrap -.006 .049 .004
40
Permutation Distribution of Cross-validated
Misclassification Rate of a Multivariate
Classifier
  • Randomly permute class labels and repeat the
    entire cross-validation
  • Re-do for all (or 1000) random permutations of
    class labels
  • Permutation p value is fraction of random
    permutations that gave as few misclassifications
    as e in the real data

41
Gene-Expression Profiles in Hereditary Breast
Cancer
  • Breast tumors studied
  • 7 BRCA1 tumors
  • 8 BRCA2 tumors
  • 7 sporadic tumors
  • Log-ratios measurements of 3226 genes for each
    tumor after initial data filtering

RESEARCH QUESTION Can we distinguish BRCA1 from
BRCA1 cancers and BRCA2 from BRCA2 cancers
based solely on their gene expression profiles?
42
BRCA1
43
BRCA2
44
Classification of BRCA2 Germline Mutations
Classification Method LOOCV Prediction Error
Compound Covariate Predictor 14
Fisher LDA 36
Diagonal LDA 14
1-Nearest Neighbor 9
3-Nearest Neighbor 23
Support Vector Machine (linear kernel) 18
Classification Tree 45
45
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?
  • Some publications fit logistic model to standard
    covariates and the cross-validated predictions of
    expression profile classifiers
  • This is valid only with split-sample analysis
    because the cross-validated predictions are not
    independent

46
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?
  • Not an issue of which variables are significant
    after adjusting for which others or which are
    independent predictors
  • Predictive accuracy and inference are different
  • The predictiveness of the expression profile
    classifier can be evaluated within levels of the
    classifier based on standard prognostic variables

47
Survival Risk Group Prediction
  • LOOCV loop
  • Create training set by omitting ith case
  • Develop PH model for training set
  • Compute predictive index for ith case using PH
    model developed for training set
  • Compute percentile of predictive index for ith
    case among predictive indices for cases in the
    training set

48
Survival Risk Group Prediction
  • Plot Kaplan Meier survival curves for cases with
    predictive index percentiles above 50 and for
    cases with cross-validated risk percentiles below
    50
  • Or for however many risk groups and thresholds is
    desired
  • Compute log-rank statistic comparing the
    cross-validated Kaplan Meier curves

49
Survival Risk Group Prediction
  • Evaluate individual genes by fitting single
    variable proportional hazards regression models
    to log expression for each gene
  • Select genes based on p-value threshold for
    single gene PH regressions
  • Compute first k principal components of the
    selected genes
  • Fit PH regression model with the k pcs as
    predictors. Let b1 , , bk denote the estimated
    regression coefficients
  • To predict for case with expression profile
    vector x, compute the k supervised pcs y1 , ,
    yk and the predictive index ? b1 y1 bk yk

50
Survival Risk Group Prediction
  • Repeat the entire procedure for permutations of
    survival times and censoring indicators to
    generate the null distribution of the log-rank
    statistic
  • The usual chi-square null distribution is not
    valid because the cross-validated risk
    percentiles are correlated among cases
  • Evaluate statistical significance of the
    association of survival and expression profiles
    by referring the log-rank statistic for the
    unpermuted data to the permutation null
    distribution

51
Sample Size Planning References
  • K Dobbin, R Simon. Sample size determination in
    microarray experiments for class comparison and
    prognostic classification. Biostatistics 627-38,
    2005
  • K Dobbin, R Simon. Sample size planning for
    developing classifiers using high dimensional DNA
    microarray data. Biostatistics 2007

52
Sample Size Planning for Classifier Development
  • The expected value (over training sets) of the
    probability of correct classification PCC(n)
    should be within ? of the maximum achievable
    PCC(?)

53
Probability Model
  • Two classes
  • Log expression or log ratio MVN in each class
    with common covariance matrix
  • m differentially expressed genes
  • p-m noise genes
  • Expression of differentially expressed genes are
    independent of expression for noise genes
  • All differentially expressed genes have same
    inter-class mean difference 2?
  • Common variance for differentially expressed
    genes and for noise genes

54
Classifier
  • Feature selection based on univariate t-tests for
    differential expression at significance level ?
  • Simple linear classifier with equal weights
    (except for sign) for all selected genes. Power
    for selecting each of the informative genes that
    are differentially expressed by mean difference
    2? is 1-?(n)

55
  • For 2 classes of equal prevalence, let ?1 denote
    the largest eigenvalue of the covariance matrix
    of informative genes. Then

56
(No Transcript)
57
(No Transcript)
58
Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
2d/s n10 n30 n50
1 0.167 0.003 0.00068
1.25 0.085 0.0011 0.00035
1.5 0.045 0.00063 0.00016
1.75 0.026 0.00036 0.00006
2 0.015 0.0002 0.00002
59
(No Transcript)
60
(No Transcript)
61
BRB-ArrayTools
  • Contains analysis tools that I have selected as
    valid and useful
  • Analysis wizard and multiple help screens for
    biomedical scientists
  • Imports data from all platforms and major
    databases
  • Automated import of data from NCBI Gene Express
    Omnibus

62
Predictive Classifiers in BRB-ArrayTools
  • Classifiers
  • Diagonal linear discriminant
  • Compound covariate
  • Bayesian compound covariate
  • Support vector machine with inner product kernel
  • K-nearest neighbor
  • Nearest centroid
  • Shrunken centroid (PAM)
  • Random forrest
  • Tree of binary classifiers for k-classes
  • Survival risk-group
  • Supervised pcs
  • Feature selection options
  • Univariate t/F statistic
  • Hierarchical variance option
  • Restricted by fold effect
  • Univariate classification power
  • Recursive feature elimination
  • Top-scoring pairs
  • Validation methods
  • Split-sample
  • LOOCV
  • Repeated k-fold CV
  • .632 bootstrap

63
Selected Features of BRB-ArrayTools
  • Multivariate permutation tests for class
    comparison to control number and proportion of
    false discoveries with specified confidence level
  • Permits blocking by another variable, pairing of
    data, averaging of technical replicates
  • SAM
  • Fortran implementation 7X faster than R versions
  • Extensive annotation for identified genes
  • Internal annotation of NetAffx, Source, Gene
    Ontology, Pathway information
  • Links to annotations in genomic databases
  • Find genes correlated with quantitative factor
    while controlling number of proportion of false
    discoveries
  • Find genes correlated with censored survival
    while controlling number or proportion of false
    discoveries
  • Analysis of variance

64
Selected Features of BRB-ArrayTools
  • Gene set enrichment analysis.
  • Gene Ontology groups, signaling pathways,
    transcription factor targets, micro-RNA putative
    targets
  • Automatic data download from Broad Institute
  • KS LS test statistics for null hypothesis that
    gene set is not enriched
  • Hotellings and Goemans Global test of null
    hypothesis that no genes in set are
    differentially expressed
  • Goemans Global test for survival data
  • Class prediction
  • Multiple classifiers
  • Complete LOOCV, k-fold CV, repeated k-fold, .632
    bootstrap
  • permutation significance of cross-validated error
    rate

65
Selected Features of BRB-ArrayTools
  • Survival risk-group prediction
  • Supervised principal components with and without
    clinical covariates
  • Cross-validated Kaplan Meier Curves
  • Permutation test of cross-validated KM curves
  • Clustering tools for class discovery with
    reproducibility statistics on clusters
  • Internal access to Eisens Cluster and Treeview
  • Visualization tools including rotating 3D
    principal components plot exportable to
    Powerpoint with rotation controls
  • Extensible via R plug-in feature
  • Tutorials and datasets

66
BRB-ArrayTools
  • Extensive built-in gene annotation and linkage to
    gene annotation websites
  • Publicly available for non-commercial use
  • http//linus.nci.nih.gov/brb

67
BRB-ArrayToolsMay 2007
  • 7188 Registered users
  • 1962 Distinct institutions
  • 68 Countries
  • 365 Citations
  • Registered users
  • 3951 in US
  • 565 at NIH
  • 275 at NCI
  • 2014 US EDU
  • 754 US Govt (non NIH)
  • 3237 Foreign

68
Countries With Most BRB ArrayTools Registered
Users
  • France 270
  • Canada 269
  • UK 244
  • Germany 239
  • Italy 216
  • Taiwan 196
  • Netherlands 177
  • Korea 168
  • Japan 153
  • China 150
  • Spain 146
  • Australia 130
  • India 107
  • Belgium 83
  • New Zeland 61
  • Sweden 50
  • Singapore 46
  • Brazil 48
  • Israel 41
  • Denmark 40

69
(No Transcript)
70
Acknowledgements
  • Kevin Dobbin
  • Alain Dupuy
  • Wenyu Jiang
  • Annette Molinaro
  • Ruth Pfeiffer
  • Michael Radmacher
  • Joanna Shih
  • Yingdong Zhao
  • BRB-ArrayTools Development Team
Write a Comment
User Comments (0)
About PowerShow.com