Statistical, Computational, and Informatics Tools for Biomarker Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical, Computational, and Informatics Tools for Biomarker Analysis

Description:

Statistical, Computational, and Informatics Tools for Biomarker Analysis ... Early Detection Research Network Exchange (ERNE) ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 51
Provided by: marktho
Learn more at: http://www.bios.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical, Computational, and Informatics Tools for Biomarker Analysis


1
Statistical, Computational, and Informatics Tools
for Biomarker Analysis
  • Methodology Development at the
  • Data Management and Coordinating Center
  • of the
  • Early Detection Research Network

2
Early Detection Research Network
EDRN ORGANIZATIONAL STRUCTURE
An infrastructure for supporting
collaborative research on molecular, genetic and
other biomarkers in human cancer detection and
risk assessment.
3
Early Detection Research Network
INFRASTRUCTURE
BIOREPOSITORY
  • Specimens with matching controls and
  • epidemiological data
  • Infrastructure to provide preneoplastic
    tissues
  • - Prostate
  • - Lung
  • - Ovarian
  • - Colon
  • - Breast

4
Early Detection Research Network
INFRASTRUCTURE
LABORATORY CAPACITY
  • Capability in high-throughput molecular and
    biochemical assays
  • Ability to respond to evolving technologies for
    EDRN needs
  • Extensive experience and scale-up ability in
    proteomics and
  • molecular assays
  • Outstanding infrastructure for handling
    multiple assays and
  • validation requests

5
Early Detection Research Network
INFRASTRUCTURE
DATA STORAGE AND MINING
  • Outstanding track record in biomarker research
  • Statistical and data mining technology
  • Statistical and predictive models for multiple
    biomarkers
  • Novel statistical methods to interpret
    high-throughput data

6
Early Detection Research Network
INFRASTRUCTURE
DATA EXCHANGE AND SHARING
  • Improving informatics and information flow
  • Network web sites
  • public web site
  • secure web site
  • Early Detection Research Network Exchange (ERNE)
  • Standardizing of Data Reporting CDEs Developed

7

Early Detection Research Network
(EDRN) INFORMATICS AND INFORMATION FLOW
8
EARLY DETECTION RESEARCH NETWORK
COLLABORATION
How To Become an Associate Member
  • Contact one of the EDRN Principal Investigators
    to serve as a sponsor for an application. Three
    types of collaborative opportunities are
    available
  • Type A Novel research ideas complementing EDRN
    ongoing efforts one year of funding at 100,000
  • Type B Share tools, technology and resources, no
    time limit
  • Type C Allow to participate in the EDRN
    Meetings and Workshop
  • For details on how to apply, see
    http//www.cancer.gov/edrn

9
DMCC Statisticians
  • Margaret Pepe, Lead of Methodology Group
  • Ziding Feng, Principal Investigator
  • Yinsheng Qu
  • Mary Lou Thompson
  • Mark Thornquist
  • Yutaka Yasui

10
Biomarker Lab Collaborators at Eastern Virginia
Medical School
  • Bao-Ling Adam
  • John Semmes
  • George Wright

11
Focus of Presentation
  • DesignPhase Structure for Biomarker Research
  • AnalysisStatistical Methods for Biomarker
    Discovery from High-Dimensional Data Sets

12
Design Phase Structure for Biomarker Research
  • Three phase structure for therapeutic trials
    well-established
  • Structure promotes coherent, thorough, efficient
    development
  • Similar structure needs to be developed for
    biomarker research

13
Biomarker Development
  • Categorize process into 5 phases
  • Define objectives for each phase
  • Define ideal study designs, evaluation and
    criteria for proceeding further
  • Standardize the process to promote efficiency and
    rigor

14
(No Transcript)
15
The Details of Study Design
  • Specific Aims
  • Subject/Specimen Selection
  • Outcome measures
  • Evaluation of Results
  • Sample Size Calculations
  • Limitations / Pitfalls

16
Specific Aims
  • Phase 1
  • Identify leads for potentially useful biomarkers
  • Prioritize these leads
  • Phase 2
  • Determine the sensitivity and specificity or ROC
    curve for the clinical biomarker assay in
    discriminating clinical cancer from controls

17
Specimen Selection -- Cases
  • Phase 1
  • Cancers that are ultimately serious if not
    treated early, but treatable in early stage
  • Spectrum of sub-types
  • Collected at diagnosis
  • Phase 2 same criteria as for phase 1
  • Wide spectrum of cases
  • Clinical specimen at diagnosis
  • From target screening population

18
Specimen Selection -- Controls
  • Phase 1
  • Non-cancer tissue same organ same patient
  • Normal tissue non-cancer patient
  • Benign growth tissue non-cancer patient
  • Phase 2
  • From potential target population for screening

19
Outcome Measures
  • Phase 1
  • True positive and False positive rates (binary
    result)
  • True positive rate at threshold yielding
    acceptable false positive rate
  • ROC curve
  • Phase 2
  • Results of clinical biomarker assay

20
Evaluation of Results
  • Phase 1
  • Algorithms select and prioritize markers that
    best distinguish tumor from non-tumor tissue
  • Initial exploratory studies need confirmation
    with new validation specimens
  • Phase 2
  • ROC curves
  • ROC regression to determine if characteristics of
    cases and/or characteristics of controls effect
    biomarkers discriminatory capacity

21
Sample Size
  • Phase 1
  • Should be large enough so that very promising
    biomarkers are likely to be selected for phase 2
    development
  • Phase 2
  • Based on a confidence intervals for the TPR or
    FPR, or confidence intervals for the ROC curve at
    selected critical points

22
Findings Sample Size Estimation
  • For phase 1 microarray experiments, use of ROC
    curves is more efficient than comparing means
  • For phase 2 studies, equal numbers of cases and
    controls is often not optimally efficient
  • Sample size calculations and look-up tables are
    now in EDRN website

23
  • Pepe et al. Phases of biomarker development for
    early detection of cancer. Journal of the
    National Cancer Institute 93(14)105461, 2001.
  • Pepe et al. Elements of Study Design for
    Biomarker Development In Tumor Markers,
    Diamandis, Fritsche, Lilja, Chan, and Schwartz ,
    eds. AAAC Press, Washington, DC. 2002.
  • 3. Pepe. Statistical Evaluation of Diagnostic
    Tests Biomarkers Oxford U. Press, 2003.

24
Selecting Differentially Expressed Genes from
Microarray ExperimentsLead Margaret Pepe
  • Context
  • gene expression arrays for nD tumor tissues and
    nC normal tissues
  • Yig logarithm relative intensity at gene g for
    tissue i.
  • for which genes are Yig different in some/most
    cases from the normals?
  • how many tissues, nD and nC, should be evaluated
    in these experiments?
  • illustrated with ovarian cancer data

25
Statistical Measures for Gene Selection
typically use a two sample t-test for each
gene we argue that sensitivity and specificity
are more directly relevant for cancer biomarker
research. focus attention on high specificity
(or high sensitivity) use the partial area
under the ROC curve to rank genes, instead of the
t-test
26
Example
Gene Rank (among 100 genes) Gene Rank (among 100 genes) Gene Rank (among 100 genes)
gene 5 gene 97
t-test 10 4
partial AUC 3 31

27
(No Transcript)
28
Sample Sizes for Gene Discovery Studies
  • traditional calculations based on statistical
    hypothesis testing
  • These are exploratory studies, need new methods
  • Propose to base calculations on the probability
    that a differentially expressed gene will rank
    high among all genes
  • Use computer simulation for sample size
    calculations

29
  • with 50 tumor and 50 normal tissues we can be
    83.6 sure that the top 30 genes will rank in the
    top 100 in the experiment.

30
  • Pepe et al. Selecting differentially expressed
    genes from microarray experiments. Biometrics (in
    press)

31
Summary
  • The method we developed for selecting genes and
    calculating sample sizes are more appropriate for
    the purpose of diagnosis and early detection

32
AnalysisStatistical Methods for Biomarker
Discovery from High-Dimensional Data Sets
  • Method development motivated by SELDI data from
    John Semmes/George Wright at Eastern Virginia
    Medical School
  • Data consist of protein intensities at tens of
    thousands of mass/charge points on each of 297
    individuals
  • Developed three approaches to biomarker
    discovery wavelets, boosting decision tree, and
    automated peak identification

33
The EVMS prostate cancer biomarker project
  • Prostate cancer patients N99 early-stage N
    98 late-stage
  • Normal controls N96
  • Serum samples for proteomic analysis by Surface
    Enhanced Laser Desorption/Ionization (SELDI)
  • Goal To discover protein signals that
    distinguish cancers from normals

34
An example of SELDI output
48,000 mass/charge points (?200K Da)
35
The design of the biomarker analysis
Normal
PCa-early
PCa-late
N96
N99
N98
36
Wavelet AnalysisLead Yinsheng Qu
  • Steps in the wavelet analysis
  • Represent original data plot with a set of
    wavelets (dimension reduction)
  • Determine those wavelets that distinguish between
    subgroups (information criterion)
  • Define discriminating functions based on the
    distinguishing wavelets (Fisher discrimination)

37
(No Transcript)
38
(No Transcript)
39
Three Group ClassificationNormal, Cancer, BPH
  • 12,352 mass spectrum data points, reduced to
  • 3,420 Haar wavelet coefficients, of which
  • 17 coefficients distinguish between the three
    cases.
  • 2 classification functions generated.
  • Truth
  • Predicted Normal Cancer BPH
  • Normal 14 0 0
  • Cancer 1 27 7
  • BPH 0 3 8

40
(No Transcript)
41
  • Qu Y et al. Data reduction using discrete wavelet
    transform in discriminant analysis with very high
    dimension. Biometrics, in press.

42
Boosted Decision Tree Method. Lead Yinsheng
Qu/Yutaka Yasui
  • This method combines multiple weak learners into
    a very accurate classifier
  • It can be used in cancer detection
  • It can also be used in identification of tumor
    markers
  • Using this method we can separate controls, BPH,
    and PCA without error in test set

43
Outline of boosting decision tree
  • The combined classifier is a committee with the
    decision stumps, the base classifiers, as its
    members. It makes decisions by majority vote.
  • The base classifiers are constructed on weighted
    examples the examples misclassified will
    increase their weights on next round.
  • The 2nd stumps specialty is to correct the 1st
    stumps mistakes, and the 3rd stumps specialty
    is to correct the 2nd stumps mistakes, and so
    on.
  • The combined classifier with dozens and even
    hundreds of decision stumps will be accurate.
  • Boosting technique is resistant to over fitting.

44
Classifier 2 A boosted decision stump classifier
with 21 peaks (potential markers)
45
The Boosting procedure
  • Yicancer, normal1, -1, fm(xi)1, -1
  • Initial weights (m1), wi 1 (i 1, . . .,N).
  • Choose first peak and threshold c.
  • For m 1 to M wi wi expam I(incorrect)
  • where am ln(1-err)/err) and err is the
    classification error rate at the current stage
  • normalize the weights so they sum to N.
  • choose a peak and c (i-th subject with weight wi)
  • Final classifier f(x) sum(amfm(x)) over m1 to
    M. f(xi)gt 0 ? i-th subject classified as cancer

46
When to stop iteration?
  • minimal margin minimum of yi f(xi) over all N
    subjects
  • The minimal margin in the training sample
    measures how well the two classes are separated
    by classifier.
  • Even classifier reaches zero error on training
    sample, if iteration still increases the minimal
    margin --gt improve prediction in future samples.

47
(No Transcript)
48
  • Qu et al. 2002. Boosted Decision Tree Analysis of
    SELDI Mass Spectral Serum Profiles Discriminates
    Prostate Cancer from Non-Cancer Patients.
    Clinical Chemistry. In press.
  • Adam et al. 2002. Serum Protein Fingerprinting
    Coupled with a Pattern Matching Algorithm that
    Distinguishes Prostate Cancer from Benign
    Prostate Hyperplasia and Healthy Men. Cancer
    Research. 623609-3614.

49
Summary
  • Wavelets approach Does not require peak
    identification (black-box classification)
  • Boosting decision tree Requires peak
    identification first. Useful for both
    classification and protein mass identification

50
Final Summary
  • The methods developed in the past two years are
    mainly for Phase 12 studies, reflecting the
    current needs of EDRN.
  • EDRN DMCC statisticians are working on key design
    and analysis issues in early detection research.
  • More work remains to be done (e.g., In
    classification, consider the mislabeling of
    Prostate cancer by BPH exam gene by
    environmental interactions).
Write a Comment
User Comments (0)
About PowerShow.com