Title: Statistical, Computational, and Informatics Tools for Biomarker Analysis
1Statistical, Computational, and Informatics Tools
for Biomarker Analysis
- Methodology Development at the
- Data Management and Coordinating Center
- of the
- Early Detection Research Network
2Early Detection Research Network
EDRN ORGANIZATIONAL STRUCTURE
An infrastructure for supporting
collaborative research on molecular, genetic and
other biomarkers in human cancer detection and
risk assessment.
3Early Detection Research Network
INFRASTRUCTURE
BIOREPOSITORY
- Specimens with matching controls and
- epidemiological data
- Infrastructure to provide preneoplastic
tissues - - Prostate
- - Lung
- - Ovarian
- - Colon
- - Breast
4Early Detection Research Network
INFRASTRUCTURE
LABORATORY CAPACITY
- Capability in high-throughput molecular and
biochemical assays - Ability to respond to evolving technologies for
EDRN needs - Extensive experience and scale-up ability in
proteomics and - molecular assays
- Outstanding infrastructure for handling
multiple assays and - validation requests
5Early Detection Research Network
INFRASTRUCTURE
DATA STORAGE AND MINING
- Outstanding track record in biomarker research
- Statistical and data mining technology
- Statistical and predictive models for multiple
biomarkers - Novel statistical methods to interpret
high-throughput data
6Early Detection Research Network
INFRASTRUCTURE
DATA EXCHANGE AND SHARING
- Improving informatics and information flow
- Network web sites
- public web site
- secure web site
- Early Detection Research Network Exchange (ERNE)
- Standardizing of Data Reporting CDEs Developed
7Early Detection Research Network
(EDRN) INFORMATICS AND INFORMATION FLOW
8EARLY DETECTION RESEARCH NETWORK
COLLABORATION
How To Become an Associate Member
- Contact one of the EDRN Principal Investigators
to serve as a sponsor for an application. Three
types of collaborative opportunities are
available - Type A Novel research ideas complementing EDRN
ongoing efforts one year of funding at 100,000 - Type B Share tools, technology and resources, no
time limit - Type C Allow to participate in the EDRN
Meetings and Workshop - For details on how to apply, see
http//www.cancer.gov/edrn
9DMCC Statisticians
- Margaret Pepe, Lead of Methodology Group
- Ziding Feng, Principal Investigator
- Yinsheng Qu
- Mary Lou Thompson
- Mark Thornquist
- Yutaka Yasui
10Biomarker Lab Collaborators at Eastern Virginia
Medical School
- Bao-Ling Adam
- John Semmes
- George Wright
11Focus of Presentation
- DesignPhase Structure for Biomarker Research
- AnalysisStatistical Methods for Biomarker
Discovery from High-Dimensional Data Sets
12Design Phase Structure for Biomarker Research
- Three phase structure for therapeutic trials
well-established - Structure promotes coherent, thorough, efficient
development - Similar structure needs to be developed for
biomarker research
13Biomarker Development
- Categorize process into 5 phases
- Define objectives for each phase
- Define ideal study designs, evaluation and
criteria for proceeding further - Standardize the process to promote efficiency and
rigor
14(No Transcript)
15The Details of Study Design
- Specific Aims
- Subject/Specimen Selection
- Outcome measures
- Evaluation of Results
- Sample Size Calculations
- Limitations / Pitfalls
16Specific Aims
- Phase 1
- Identify leads for potentially useful biomarkers
- Prioritize these leads
- Phase 2
- Determine the sensitivity and specificity or ROC
curve for the clinical biomarker assay in
discriminating clinical cancer from controls
17Specimen Selection -- Cases
- Phase 1
- Cancers that are ultimately serious if not
treated early, but treatable in early stage - Spectrum of sub-types
- Collected at diagnosis
- Phase 2 same criteria as for phase 1
- Wide spectrum of cases
- Clinical specimen at diagnosis
- From target screening population
18Specimen Selection -- Controls
- Phase 1
- Non-cancer tissue same organ same patient
- Normal tissue non-cancer patient
- Benign growth tissue non-cancer patient
- Phase 2
- From potential target population for screening
19Outcome Measures
- Phase 1
- True positive and False positive rates (binary
result) - True positive rate at threshold yielding
acceptable false positive rate - ROC curve
- Phase 2
- Results of clinical biomarker assay
20Evaluation of Results
- Phase 1
- Algorithms select and prioritize markers that
best distinguish tumor from non-tumor tissue - Initial exploratory studies need confirmation
with new validation specimens
- Phase 2
- ROC curves
- ROC regression to determine if characteristics of
cases and/or characteristics of controls effect
biomarkers discriminatory capacity
21Sample Size
- Phase 1
- Should be large enough so that very promising
biomarkers are likely to be selected for phase 2
development
- Phase 2
- Based on a confidence intervals for the TPR or
FPR, or confidence intervals for the ROC curve at
selected critical points
22Findings Sample Size Estimation
- For phase 1 microarray experiments, use of ROC
curves is more efficient than comparing means - For phase 2 studies, equal numbers of cases and
controls is often not optimally efficient - Sample size calculations and look-up tables are
now in EDRN website
23- Pepe et al. Phases of biomarker development for
early detection of cancer. Journal of the
National Cancer Institute 93(14)105461, 2001. - Pepe et al. Elements of Study Design for
Biomarker Development In Tumor Markers,
Diamandis, Fritsche, Lilja, Chan, and Schwartz ,
eds. AAAC Press, Washington, DC. 2002. - 3. Pepe. Statistical Evaluation of Diagnostic
Tests Biomarkers Oxford U. Press, 2003.
24Selecting Differentially Expressed Genes from
Microarray ExperimentsLead Margaret Pepe
- Context
- gene expression arrays for nD tumor tissues and
nC normal tissues - Yig logarithm relative intensity at gene g for
tissue i. - for which genes are Yig different in some/most
cases from the normals? - how many tissues, nD and nC, should be evaluated
in these experiments? - illustrated with ovarian cancer data
25Statistical Measures for Gene Selection
typically use a two sample t-test for each
gene we argue that sensitivity and specificity
are more directly relevant for cancer biomarker
research. focus attention on high specificity
(or high sensitivity) use the partial area
under the ROC curve to rank genes, instead of the
t-test
26Example
Gene Rank (among 100 genes) Gene Rank (among 100 genes) Gene Rank (among 100 genes)
gene 5 gene 97
t-test 10 4
partial AUC 3 31
27(No Transcript)
28Sample Sizes for Gene Discovery Studies
- traditional calculations based on statistical
hypothesis testing - These are exploratory studies, need new methods
- Propose to base calculations on the probability
that a differentially expressed gene will rank
high among all genes - Use computer simulation for sample size
calculations
29- with 50 tumor and 50 normal tissues we can be
83.6 sure that the top 30 genes will rank in the
top 100 in the experiment.
30- Pepe et al. Selecting differentially expressed
genes from microarray experiments. Biometrics (in
press)
31Summary
- The method we developed for selecting genes and
calculating sample sizes are more appropriate for
the purpose of diagnosis and early detection
32AnalysisStatistical Methods for Biomarker
Discovery from High-Dimensional Data Sets
- Method development motivated by SELDI data from
John Semmes/George Wright at Eastern Virginia
Medical School - Data consist of protein intensities at tens of
thousands of mass/charge points on each of 297
individuals - Developed three approaches to biomarker
discovery wavelets, boosting decision tree, and
automated peak identification
33The EVMS prostate cancer biomarker project
- Prostate cancer patients N99 early-stage N
98 late-stage - Normal controls N96
- Serum samples for proteomic analysis by Surface
Enhanced Laser Desorption/Ionization (SELDI) - Goal To discover protein signals that
distinguish cancers from normals
34An example of SELDI output
48,000 mass/charge points (?200K Da)
35The design of the biomarker analysis
Normal
PCa-early
PCa-late
N96
N99
N98
36Wavelet AnalysisLead Yinsheng Qu
- Steps in the wavelet analysis
- Represent original data plot with a set of
wavelets (dimension reduction) - Determine those wavelets that distinguish between
subgroups (information criterion) - Define discriminating functions based on the
distinguishing wavelets (Fisher discrimination)
37(No Transcript)
38(No Transcript)
39Three Group ClassificationNormal, Cancer, BPH
- 12,352 mass spectrum data points, reduced to
- 3,420 Haar wavelet coefficients, of which
- 17 coefficients distinguish between the three
cases. - 2 classification functions generated.
- Truth
- Predicted Normal Cancer BPH
- Normal 14 0 0
- Cancer 1 27 7
- BPH 0 3 8
40(No Transcript)
41- Qu Y et al. Data reduction using discrete wavelet
transform in discriminant analysis with very high
dimension. Biometrics, in press.
42Boosted Decision Tree Method. Lead Yinsheng
Qu/Yutaka Yasui
- This method combines multiple weak learners into
a very accurate classifier - It can be used in cancer detection
- It can also be used in identification of tumor
markers - Using this method we can separate controls, BPH,
and PCA without error in test set
43Outline of boosting decision tree
- The combined classifier is a committee with the
decision stumps, the base classifiers, as its
members. It makes decisions by majority vote. - The base classifiers are constructed on weighted
examples the examples misclassified will
increase their weights on next round. - The 2nd stumps specialty is to correct the 1st
stumps mistakes, and the 3rd stumps specialty
is to correct the 2nd stumps mistakes, and so
on. - The combined classifier with dozens and even
hundreds of decision stumps will be accurate. - Boosting technique is resistant to over fitting.
44Classifier 2 A boosted decision stump classifier
with 21 peaks (potential markers)
45The Boosting procedure
- Yicancer, normal1, -1, fm(xi)1, -1
- Initial weights (m1), wi 1 (i 1, . . .,N).
- Choose first peak and threshold c.
- For m 1 to M wi wi expam I(incorrect)
- where am ln(1-err)/err) and err is the
classification error rate at the current stage - normalize the weights so they sum to N.
- choose a peak and c (i-th subject with weight wi)
- Final classifier f(x) sum(amfm(x)) over m1 to
M. f(xi)gt 0 ? i-th subject classified as cancer -
46When to stop iteration?
- minimal margin minimum of yi f(xi) over all N
subjects - The minimal margin in the training sample
measures how well the two classes are separated
by classifier. - Even classifier reaches zero error on training
sample, if iteration still increases the minimal
margin --gt improve prediction in future samples.
47(No Transcript)
48- Qu et al. 2002. Boosted Decision Tree Analysis of
SELDI Mass Spectral Serum Profiles Discriminates
Prostate Cancer from Non-Cancer Patients.
Clinical Chemistry. In press. - Adam et al. 2002. Serum Protein Fingerprinting
Coupled with a Pattern Matching Algorithm that
Distinguishes Prostate Cancer from Benign
Prostate Hyperplasia and Healthy Men. Cancer
Research. 623609-3614.
49Summary
- Wavelets approach Does not require peak
identification (black-box classification) - Boosting decision tree Requires peak
identification first. Useful for both
classification and protein mass identification
50Final Summary
- The methods developed in the past two years are
mainly for Phase 12 studies, reflecting the
current needs of EDRN. - EDRN DMCC statisticians are working on key design
and analysis issues in early detection research. - More work remains to be done (e.g., In
classification, consider the mislabeling of
Prostate cancer by BPH exam gene by
environmental interactions).