Design

About This Presentation

Title:

Design

Description:

Simon R, Korn E, McShane L, Radmacher M, Wright G, Zhao Y. Design and analysis ... Korn EL, Troendle JF, McShane LM, Simon R.Controlling the number of false ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 121

Provided by: Richar7

Learn more at: https://brb.nci.nih.gov

Category:

more less

Transcript and Presenter's Notes

Title: Design

1
Design Analysis of Microarray Studies for
Diagnostic Prognostic Classification

Richard Simon, D.Sc.
Chief, Biometric Research Branch
National Cancer Institute
http//linus.nci.nih.gov/brb

2
http//linus.nci.nih.gov/brb

http//linus.nci.nih.gov/brb
Powerpoint presentations
Reprints Technical Reports
BRB-ArrayTools software
BRB-ArrayTools Data Archive
Sample Size Planning for Targeted Clinical Trials

3
Simon R, Korn E, McShane L, Radmacher M, Wright
G, Zhao Y. Design and analysis of DNA microarray
investigations, Springer-Verlag, 2003.
Radmacher MD, McShane LM, Simon R. A paradigm fo
r class prediction using gene expression
profiles. Journal of Computational Biology
9505-511, 2002. Simon R, Radmacher MD, Dobbin
K, McShane LM. Pitfalls in the analysis of DNA
microarray data. Journal of the National Cancer
Institute 9514-18, 2003. Dobbin K, Simon R. Co
mparison of microarray designs for class
comparison and class discovery, Bioinformatics
181462-69, 2002 19803-810, 2003 212430-37,
2005 212803-4, 2005. Dobbin K and Simon R. Sa
mple size determination in microarray experiments
for class comparison and prognostic
classification. Biostatistics 627-38, 2005.
Dobbin K, Shih J, Simon R. Questions and answers
on design of dual-label microarrays for
identifying differentially expressed genes.
Journal of the National Cancer Institute
951362-69, 2003. Wright G, Simon R. A random v
ariance model for detection of differential gene
expression in small microarray experiments.
Bioinformatics 192448-55, 2003.
Korn EL, Troendle JF, McShane LM, Simon R.Contro
lling the number of false discoveries. Journal of
Statistical Planning and Inference 124379-08,
2004. Molinaro A, Simon R, Pfeiffer R. Predicti
on error estimation A comparison of resampling
methods. Bioinformatics 213301-7,2005.
4
Simon R. Using DNA microarrays for diagnostic an
d prognostic prediction. Expert Review of
Molecular Diagnostics, 3(5) 587-595, 2003.
Simon R. Diagnostic and prognostic prediction us
ing gene expression profiles in high dimensional
microarray data. British Journal of Cancer
891599-1604, 2003. Simon R and Maitnourim A. E
valuating the efficiency of targeted designs for
randomized clinical trials. Clinical Cancer
Research 106759-63, 2004. Maitnourim A and S
imon R. On the efficiency of targeted clinical
trials. Statistics in Medicine 24329-339, 2005.
Simon R. When is a genomic classifier ready for
prime time? Nature Clinical Practice Oncology
14-5, 2004. Simon R. An agenda for Clinical Tr
ials clinical trials in the genomic era.
Clinical Trials 1468-470, 2004.
Simon R. Development and Validation of Therapeut
ically Relevant Multi-gene Biomarker Classifiers.
Journal of the National Cancer Institute
97866-867, 2005. Simon R. A roadmap for devel
oping and validating therapeutically relevant
genomic classifiers. Journal of Clinical Oncology
(In Press). Freidlin B and Simon R. Adaptive si
gnature design. Clinical Cancer Research (In
Press). Simon R. Validation of pharmacogenomic
biomarker classifiers for treatment selection.
Disease Markers (In Press). Simon R. Guideline
s for the design of clinical studies for
development and validation of therapeutically
relevant biomarkers and biomarker classification
systems. In Biomarkers in Breast Cancer, Hayes DF
and Gasparini G, Humana Press (In Press).
5
Myth

That microarray investigations should be
unstructured data-mining adventures without clear
objectives

Good microarray studies have clear objectives,
but not generally gene specific mechanistic
hypotheses
Design and analysis methods should be tailored to
study objectives

7
Good Microarray Studies Have Clear Objectives

Class Comparison
Find genes whose expression differs among
predetermined classes
Class Prediction
Prediction of predetermined class (phenotype)
using information from gene expression profile
Class Discovery
Discover clusters of specimens having similar
expression profiles
Discover clusters of genes having similar
expression profiles

8
Class Comparison and Class Prediction

Not clustering problems
Global similarity measures generally used for
clustering arrays may not distinguish classes
Dont control multiplicity or for distinguishing
data used for classifier development from data
used for classifier evaluation
Supervised methods
Requires multiple biological samples from each
class

9
Levels of Replication

Technical replicates
RNA sample divided into multiple aliquots and
re-arrayed
Biological replicates
Multiple subjects
Replication of the tissue culture experiment

Biological conclusions generally require
independent biological replicates. The power of
statistical methods for microarray data depends
on the number of biological replicates.
Technical replicates are useful insurance to
ensure that at least one good quality array of
each specimen will be obtained.

11
Class Prediction

Predict which tumors will respond to a particular
treatment
Predict which patients will relapse after a
particular treatment

Class prediction methods usually have gene
selection as a component
The criteria for gene selection for class
prediction and for class comparison are
different
For class comparison false discovery rate is
important
For class prediction, predictive accuracy is
important

13
Clarity of Objectives is Important

Patient selection
Many microarray studies developing classifiers
are not therapeutically relevant
Analysis methods
Many microarray studies use cluster analysis
inappropriately or misleadingly

14
Microarray Platforms for Developing Predictive
Classifiers

Single label arrays
Affymetrix GeneChips
Dual label arrays using common reference design
Dye swaps are unnecessary

15
Common Reference Design
A1
A2
B1
B2
RED
R
R
R
R
GREEN
Array 1
Array 2
Array 3
Array 4
Ai ith specimen from class A
Bi ith specimen from class B
R aliquot from reference pool
16

The reference generally serves to control
variation in the size of corresponding spots on
different arrays and variation in sample
distribution over the slide.
The reference provides a relative measure of
expression for a given gene in a given sample
that is less variable than an absolute measure.
The reference is not the object of comparison.
The relative measure of expression will be
compared among biologically independent samples
from different classes.

17
(No Transcript)
18
Myth

For two color microarrays, each sample of
interest should be labeled once with Cy3 and once
with Cy5 in dye-swap pairs of arrays.

19
Dye Bias

Average differences among dyes in label
concentration, labeling efficiency, photon
emission efficiency and photon detection are
corrected by normalization procedures
Gene specific dye bias may not be corrected by
normalization

Dye swap technical replicates of the same two rna
samples are rarely necessary.
Using a common reference design, dye swap arrays
are not necessary for valid comparisons of
classes since specimens labeled with different
dyes are never compared.
For two-label direct comparison designs for
comparing two classes, it is more efficient to
balance the dye-class assignments for independent
biological specimens than to do dye swap
technical replicates

21
Balanced Block Design
A1
B2
A3
B4
RED
A2
B1
B3
A4
GREEN
Array 1
Array 2
Array 3
Array 4
Ai ith specimen from class A
Bi ith specimen from class B
22

Detailed comparisons of the effectiveness of
designs
Dobbin K, Simon R. Comparison of microarray
designs for class comparison and class discovery.
Bioinformatics 181462-9, 2002
Dobbin K, Shih J, Simon R. Statistical design of
reverse dye microarrays. Bioinformatics
19803-10, 2003
Dobbin K, Simon R. Questions and answers on the
design of dual-label microarrays for identifying
differentially expressed genes, JNCI
951362-1369, 2003

Common reference designs are very effective for
many microarray studies. They are robust, permit
comparisons among separate experiments, and
permit many types of comparisons and analyses to
be performed.
For simple two class comparison problems,
balanced block designs require many fewer arrays
than common reference designs.
Efficiency decreases for more than two classes
Are more difficult to apply to more complicated
class comparison problems.
They are not appropriate for class discovery or
class prediction.
Loop designs are less robust, and dominated by
either common reference designs or balanced block
designs, and are not suitable for class
prediction or class discovery.

24
What We Will Not Discuss Today

Image analysis
Normalization
Clustering methods
Class comparison
SAM and other methods of gene finding
FDR (false discovery rate) and methods for
controlling the number of false positive genes

25
Simple Procedures

If each gene is tested for significance at level
? and there are k genes, then the expected number
of false discoveries is k ? .
To control E(FD) ? u
Conduct each of k tests at level ? u/k
Bonferroni control of familywise error (FWE) rate
at level 0.05
Conduct each of k tests at level 0.05/k
At least 95 confident that FD 0

26
False Discovery Rate (FDR)

FDR Expected proportion of false discoveries
among the tests declared significant
Studied by Benjamini and Hochberg (1995)

27
(No Transcript)
28
Controlling the Expected False Discovery Rate

Compare classes separately by gene and compute
significance levels pi
Rank genes in order of significance
p(1)
Find largest index i for which
p(i)k / i ? FDR
Consider genes with the ith smallest p values
as statistically significant

29
Problems With Simple Procedures

Bonferroni control of FWE is very conservative
p values based on normal theory are not accurate
at extremes quantiles
Difficult to achieve extreme quantiles for
permutation p values of individual genes
Controlling expected number or proportion of
false discoveries may not provide adequate
control because distributions of FD and FDP may
have large variances
Multiple comparisons are controlled by adjustment
of univariate (single gene) p values and so may
not take advantage of correlation among genes

30
Additional Procedures

SAM - Significance Analysis of Microarrays
Tusher et al., PNAS, 2001
Estimate FDR
Statistical properties unclear
Empirical Bayes
Efron et al., JASA, 2001
Related to FDR
Ad hoc aspects
Multivariate permutation tests
Korn et al., 2001 (http//linus.nci.nih.gov/brb)
Control number or proportion of false
discoveries
Can specify confidence level of control

31
Multivariate Permutation Procedures(Korn et al.,
2001)

Allows statements like
FD Procedure We are 95 confident that the
(actual) number of false discoveries is no
greater than 5.
FDP Procedure We are 95 confident that the
(actual) proportion of false discoveries does not
exceed .10.

32
Class Prediction Model

Given a sample with an expression profile vector
x of log-ratios or log signals and unknown class.
Predict which class the sample belongs to
The class prediction model is a function f which
maps from the set of vectors x to the set of
class labels 1,2 (if there are two classes).
f generally utilizes only some of the components
of x (i.e. only some of the genes)
Specifying the model f involves specifying some
parameters (e.g. regression coefficients) by
fitting the model to the data (learning the data).

33
Problems With Many Diagnostic/Prognostic Marker
Studies

Are not reproducible
Retrospective non-focused analysis
Multiplicity problems
Inter-laboratory assay variation
Have no impact
Not therapeutically relevant questions
Not therapeutically relevant group of patients
Black box predictors

34
Components of Class Prediction

Feature (gene) selection
Which genes will be included in the model
Select model type
E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor,
Fitting parameters (regression coefficients) for
model
Selecting value of tuning parameters

35
Do Not Confuse Statistical Methods Appropriate
for Class Comparison with Those Appropriate for
Class Prediction

Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy.
Demonstrating goodness of fit of a model to the
data used to develop it is not a demonstration
of predictive accuracy.
Statisticians are used to inference, not
prediction
Most statistical methods were not developed for
pn prediction problems

36
Feature Selection

Genes that are differentially expressed among the
classes at a significance level ? (e.g. 0.01)
The ? level is selected only to control the
number of genes in the model

37
t-test Comparisons of Gene Expression

xjN(?j1 , ?j2) for class 1
xjN(?j2 , ?j2) for class 2
H0j ?j1 ?j2

38
Estimation of Within-Class Variance

Estimate separately for each gene
Limited degrees of freedom
Gene list dominated by genes with small fold
changes and small variances
Assume all genes have same variance
Poor assumption
Random (hierarchical) variance model
Wright G.W. and Simon R. Bioinformatics192448-245
5,2003
Inverse gamma distribution of residual variances
Results in exact F (or t) distribution of test
statistics with increased degrees of freedom for
error variance
For any normal linear model

39
Feature Selection

Small subset of genes which together give most
accurate predictions
Combinatorial optimization algorithms
Genetic algorithms
Little evidence that complex feature selection is
useful in microarray problems
Failure to compare to simpler methods
Some published complex methods for selecting
combinations of features do not appear to have
been properly evaluated

40
Linear Classifiers for Two Classes
41
Linear Classifiers for Two Classes

Fisher linear discriminant analysis
Requires estimating correlations among all genes
selected for model
y vector of class labels
Diagonal linear discriminant analysis (DLDA)
assumes features are uncorrelated
Naïve Bayes classifier
Compound covariate predictor (Radmacher) and
Golubs method are similar to DLDA in that they
can be viewed as weighted voting of univariate
classifiers

42
Linear Classifiers for Two Classes

Compound covariate predictor
Instead of for DLDA

43
Linear Classifiers for Two Classes

Support vector machines with inner product kernel
are linear classifiers with weights determined to
separate the classes with a hyperplain that
minimizes the length of the weight vector

44
Support Vector Machine
45
Perceptrons

Perceptrons are neural networks with no hidden
layer and linear transfer functions between input
output
Number of input nodes equals number of genes
selected
Number of output nodes equals number of classes
minus 1
Number of inputs may be major principal
components of genes or major principal components
of informative genes
Perceptrons are linear classifiers

46
Naïve Bayes Classifier

Expression profiles for class j assumed normal
with mean vector mj and diagonal covariance
matrix D
Likelihood of expression profile vector x is l(x
mj ,D)
Posterior probability of class j for case with
expression profile vector x is proportional to pj
l(x mj ,D)

47
Compound Covariate Bayes Classifier

Compound covariate y ?tixi
Sum over the genes selected as differentially
expressed
xi the expression level of the ith selected gene
for the case whose class is to be predicted
ti the t statistic for testing differential
expression for the ith gene
Proceed as for the naïve Bayes classifier but
using the single compound covariate as predictive
variable
GW Wright et al. PNAS 2005.

48
When pn The Linear Model is Too Complex

It is always possible to find a set of features
and a weight vector for which the classification
error on the training set is zero.
Why consider more complex models?

49
Myth

Complex classification algorithms such as neural
networks perform better than simpler methods for
class prediction.

Artificial intelligence sells to journal
reviewers and peers who cannot distinguish hype
from substance when it comes to microarray data
analysis.
Comparative studies have shown that simpler
methods work as well or better for microarray
problems because they avoid overfitting the data.

51
Other Simple Methods

Nearest neighbor classification
Nearest k-neighbors
Nearest centroid classification
Shrunken centroid classification

52
Nearest Neighbor Classifier

To classify a sample in the validation set as
being in outcome class 1 or outcome class 2,
determine which sample in the training set its
gene expression profile is most similar to.
Similarity measure used is based on genes
selected as being univariately differentially
expressed between the classes
Correlation similarity or Euclidean distance
generally used
Classify the sample as being in the same class as
its nearest neighbor in the training set

53
Other Methods

Neural networks
Top-scoring pairs
CART
Random Forrest
Genetic algorithm based classification

54
Apparent Dimension Reduction Based Methods

Principal component regression
Supervised principal component regression
Partial least squares
Stepwise logistic regression

55
When There Are More Than 2 Classes

Nearest neighbor type methods
Decision tree of binary classifiers

56
Decision Tree of Binary Classifiers

Partition the set of classes 1,2,,K into two
disjoint subsets S1 and S2
Develop a binary classifier for distinguishing
the composite classes S1 and S2
Compute the cross-validated classification error
for distinguishing S1 and S2
Repeat the above steps for all possible
partitions in order to find the partition S1and
S2 for which the cross-validated classification
error is minimized
If S1and S2 are not singleton sets, then repeat
all of the above steps separately for the classes
in S1and S2 to optimally partition each of them

57
(No Transcript)
58
Evaluating a Classifier

Prediction is difficult, especially the
future.
Neils Bohr
Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data.

59
Evaluating a Classifier

Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data
Goodness of fit vs prediction accuracy
Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy
Demonstrating stability of identification of gene
predictors is not necessary for demonstrating
predictive accuracy

60
Evaluating a Classifier

The classification algorithm includes the
following parts
Determining what type of classifier to use
Gene selection
Fitting parameters
Optimizing with regard to tuning parameters
If a re-sampling method such as cross-validation
is to be used to estimate predictive error of a
classifier, all aspects of the classification
algorithm must be repeated for each training set
and the accuracy of the resulting classifier
scored on the corresponding validation set

61
Split-Sample Evaluation

Training-set
Used to select features, select model type,
determine parameters and cut-off thresholds
Test-set
Withheld until a single model is fully specified
using the training-set.
Fully specified model is applied to the
expression profiles in the test-set to predict
class labels.
Number of errors is counted
Ideally test set data is from different centers
than the training data and assayed at a different
time

62
Leave-one-out Cross Validation

Omit sample 1
Develop multivariate classifier from scratch on
training set with sample 1 omitted
Predict class for sample 1 and record whether
prediction is correct

63
Leave-one-out Cross Validation

Repeat analysis for training sets with each
single sample omitted one at a time
e number of misclassifications determined by
cross-validation
Subdivide e for estimation of sensitivity and
specificity

Cross validation is only valid if the test set is
not used in any way in the development of the
model. Using the complete set of samples to
select genes violates this assumption and
invalidates cross-validation.
With proper cross-validation, the model must be
developed from scratch for each leave-one-out
training set. This means that feature selection
must be repeated for each leave-one-out training
set.
The cross-validated estimate of misclassification
error is an estimate of the prediction error for
model fit using specified algorithm to full
dataset
If you use cross-validation estimates of
prediction error for a set of algorithms indexed
by a tuning parameter and select the algorithm
with the smallest cv error estimate, you do not
have a valid estimate of the prediction error for
the selected model

65
Prediction on Simulated Null Data

Generation of Gene Expression Profiles
14 specimens (Pi is the expression profile for
specimen i)
Log-ratio measurements on 6000 genes
Pi MVN(0, I6000)
Can we distinguish between the first 7 specimens
(Class 1) and the last 7 (Class 2)?
Prediction Method
Compound covariate prediction (discussed later)
Compound covariate built from the log-ratios of
the 10 most differentially expressed genes.

66
(No Transcript)
67
Invalid Criticisms of Cross-Validation

You can always find a set of features that will
provide perfect prediction for the training and
test sets.
For complex models, there may be many sets of
features that provide zero training errors.
A modeling strategy that either selects among
those sets or aggregates among those models, will
have a generalization error which will be validly
estimated by cross-validation.

68
(No Transcript)
69
(No Transcript)
70
Simulated Data40 cases, 10 genes selected from
5000
71
DLBCL Data
72
Simulated Data40 cases
73
Permutation Distribution of Cross-validated
Misclassification Rate of a Multivariate
Classifier

Randomly permute class labels and repeat the
entire cross-validation
Re-do for all (or 1000) random permutations of
class labels
Permutation p value is fraction of random
permutations that gave as few misclassifications
as e in the real data

74
Gene-Expression Profiles in Hereditary Breast
Cancer

Breast tumors studied
7 BRCA1 tumors
8 BRCA2 tumors
7 sporadic tumors
Log-ratios measurements of 3226 genes for each
tumor after initial data filtering

RESEARCH QUESTION Can we distinguish BRCA1 from
BRCA1 cancers and BRCA2 from BRCA2 cancers
based solely on their gene expression profiles?

75
BRCA1
76
BRCA2
77
Classification of BRCA2 Germline Mutations
78
Common Problems With Internal Classifier
Validation

Pre-selection of genes using entire dataset
Failure to consider optimization of tuning
parameter part of classification algorithm
Varma Simon, BMC Bioinformatics 2006
Erroneous use of predicted class in regression
model

79
Incomplete (incorrect) Cross-Validation

Publications are using all the data to select
genes and then cross-validating only the
parameter estimation component of model
development
Highly biased
Many published complex methods which make strong
claims based on incorrect cross-validation.
Frequently seen in complex feature set selection
algorithms
Some software encourages inappropriate
cross-validation

80
Incomplete (incorrect) Cross-Validation

Let M(b,D) denote a classification model
developed on a set of data D where the model is
of a particular type that is parameterized by a
scalar b.
Use cross-validation to estimate the
classification error of M(b,D) for a grid of
values of b Err(b).
Select the value of b that minimizes Err(b).
Caution Err(b) is a biased estimate of the
prediction error of M(b,D).
This error is made in some commonly used methods

81
Complete (correct) Cross-Validation

Construct a learning set D as a subset of the
full set S of cases.
Use cross-validation restricted to D in order to
estimate the classification error of M(b,D) for a
grid of values of b Err(b).
Select the value of b that minimizes Err(b).
Use the mode M(b,D) to predict for the cases in
S but not in D (S-D) and compute the error rate
in S-D
Repeat this full procedure for different learning
sets D1 , D2 and average the error rates of the
models M(bi,Di) over the corresponding
validation sets S-Di

82
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?

Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors
Predictive accuracy and inference are different
The two classifiers can be compared by ROC
analysis as functions of the threshold for
classification
The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic variables

83
Does an Expression Profile Classifier Predict
More Accurately Than Standard Prognostic
Variables?

Some publications fit logistic model to standard
covariates and the cross-validated predictions of
expression profile classifiers
This is valid only with split-sample analysis
because the cross-validated predictions are not
independent

84
External Validation

Should address clinical utility, not just
predictive accuracy
Therapeutic relevance
Should incorporate all sources of variability
likely to be seen in broad clinical application
Expression profile assay distributed over time
and space
Real world tissue handling
Patients selected from different centers than
those used for developing the classifier

85
Survival Risk Group Prediction

Evaluate individual genes by fitting single
variable proportional hazards regression models
to log signal or log ratio for gene
Select genes based on p-value threshold for
single gene PH regressions
Compute first k principal components of the
selected genes
Fit PH regression model with the k pcs as
predictors. Let b1 , , bk denote the estimated
regression coefficients
To predict for case with expression profile
vector x, compute the k supervised pcs y1 , ,
yk and the predictive index ? b1 y1 bk yk

86
Survival Risk Group Prediction

LOOCV loop
Create training set by omitting ith case
Develop supervised pc PH model for training set
Compute cross-validated predictive index for ith
case using PH model developed for training set
Compute predictive risk percentile of predictive
index for ith case among predictive indices for
cases in the training set

87
Survival Risk Group Prediction

Plot Kaplan Meier survival curves for cases with
cross-validated risk percentiles above 50 and
for cases with cross-validated risk percentiles
below 50
Or for however many risk groups and thresholds is
desired
Compute log-rank statistic comparing the
cross-validated Kaplan Meier curves

88
Survival Risk Group Prediction

Repeat the entire procedure for all (or large
number) of permutations of survival times and
censoring indicators to generate the null
distribution of the log-rank statistic
The usual chi-square null distribution is not
valid because the cross-validated risk
percentiles are correlated among cases
Evaluate statistical significance of the
association of survival and expression profiles
by referring the log-rank statistic for the
unpermuted data to the permutation null
distribution

89
Survival Risk Group Prediction

Other approaches to survival risk group
prediction have been published
The supervised pc method is implemented in
BRB-ArrayTools
BRB-ArrayTools also provides for comparing the
risk group classifier based on expression
profiles to one based on standard covariates and
one based on a combination of both types of
variables

90
Sample Size Planning References

K Dobbin, R Simon. Sample size determination in
microarray experiments for class comparison and
prognostic classification. Biostatistics 627-38,
2005
K Dobbin, R Simon. Sample size planning for
developing classifiers using high dimensional DNA
microarray data. Biostatistics (In Press)

91
Sample Size Planning

GOAL Identify genes differentially expressed in
a comparison of two pre-defined classes of
specimens on dual-label arrays using reference
design or single label arrays
Compare classes separately by gene with
adjustment for multiple comparisons
Approximate expression levels (log ratio or log
signal) as normally distributed
Determine number of samples n/2 per class to give
power 1-? for detecting mean difference ? at
level ?

92
Comparing 2 equal size classes

n 4?2(z?/2 z?)2/?2
where ? mean log-ratio difference between
classes
? within class standard deviation of
biological replicates
z?/2, z? standard normal percentiles
Choose ?? small, e.g. ?? .001
Use percentiles of t distribution for improved
accuracy

93
Total Number of Samples for Two Class Comparison
94
Dual Label Arrays With Reference DesignPools of
k Biological Samples

m number of technical reps per sample
k number of samples per pool
n total number of arrays
? mean difference between classes in log
signal
?2 biological variance within class
?2 technical variance
? significance level e.g. 0.001
1-? power
z normal percentiles (use t percentiles for
better accuracy)

96
?0.001, ?0.05, ?1, ?22?20.25, ?2/?24
97
?0.001 ?0.05 ?1 ?22?20.25, ?2/?24
98
Number of Events Needed to Detect Gene Specific
Effects on Survival

? standard deviation in log2 ratios for each
gene
? hazard ratio (1) corresponding to 2-fold
change in gene expression
? 1/N for 1 expected false positive gene
identified per N genes examined
? 0.05 for 5 false negative rate

99
Number of Events Required to Detect Gene Specific
Effects on Survival ?0.001,?0.05
100
Sample Size Planning for Classifier Development

The expected value (over training sets) of the
probability of correct classification PCC(n)
should be within ? of the maximum achievable
PCC(?)

101
Probability Model

Two classes
Log expression or log ratio MVN in each class
with common covariance matrix
m differentially expressed genes
p-m noise genes
Expression of differentially expressed genes are
independent of expression for noise genes
All differentially expressed genes have same
inter-class mean difference 2?
Common variance for differentially expressed
genes and for noise genes

102
Classifier

Feature selection based on univariate t-tests for
differential expression at significance level ?
Simple linear classifier with equal weights
(except for sign) for all selected genes. Power
for selecting each of the informative genes that
are differentially expressed by mean difference
2? is 1-?(n)

103

For 2 classes of equal prevalence, let ?1 denote
the largest eigenvalue of the covariance matrix
of informative genes. Then

104
(No Transcript)
105
Sample size as a function of effect size
(log-base 2 fold-change between classes divided
by standard deviation). Two different tolerances
shown, . Each class is equally represented in the
population. 22000 genes on an array.
106
(No Transcript)
107
Optimal significance level cutoffs for gene
selection. 50 differentially expressed genes out
of 22,000 genes on the microarrays
108
?0.05, p22,000 genes, gene standard deviation
s0.75.
109
a) Power example with 60 samples, 2d/s1/0.71
effect size for differentially expressed genes,
alpha 0.001 cutoffs for gene selection. As the
proportion in the under-represented class gets
smaller, the power to identify differentially
expressed genes decreases.
110
b) PCC(60) as a function of the proportion in the
under-represented class. Parameter settings same
as a), with 10 differentially expressed genes
among 22,000 total genes. If the proportion in
the under-represented class is small (e.g.,
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
BRB-ArrayTools

Integrated software package using Excel-based
user interface but state-of-the art analysis
methods programmed in R, Java Fortran
Publicly available for non-commercial use

http//linus.nci.nih.gov/brb
117
Selected Features of BRB-ArrayTools

Multivariate permutation tests for class
comparison to control number and proportion of
false discoveries with specified confidence
level
Permits blocking by another variable, pairing of
data, averaging of technical replicates
SAM
Fortran implementation 7X faster than R versions
Extensive annotation for identified genes
Internal annotation of NetAffx, Source, Gene
Ontology, Pathway information
Links to annotations in genomic databases
Find genes correlated with quantitative factor
while controlling number of proportion of false
discoveries
Find genes correlated with censored survival
while controlling number or proportion of false
discoveries
Analysis of variance
Time course analysis
Log intensities for non-reference designs
Mixed models

118
Selected Features of BRB-ArrayTools

Gene enhancement analysis.
Find Gene Ontology groups and signaling pathways
that are differentially expressed among classes
Class prediction
DLDA, CCP, Nearest Neighbor, Nearest Centroid,
Shrunken Centroids, SVM, Random Forests
Complete LOOCV, k-fold CV, repeated k-fold,
.632 bootstrap
permutation significance of cross-validated
error rate

119
Selected Features of BRB-ArrayTools

Clustering tools for class discovery with
reproducibility statistics on clusters
Internal access to Eisens Cluster and Treeview
Visualization tools including rotating 3D
principal components plot exportable to
Powerpoint with rotation controls
Extensible via R plug-in feature
Tutorials and datasets

120
Acknowledgements