Active Sampling for Knowledge Discovery from Biomedical Data - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Active Sampling for Knowledge Discovery from Biomedical Data

Description:

clinical features (e.g. status of the patient, relapse) ... TMA block is a paraffin block with 100-1000 tissue cores in array fasion ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 19
Provided by: sra8
Category:

less

Transcript and Presenter's Notes

Title: Active Sampling for Knowledge Discovery from Biomedical Data


1
  • Active Sampling for Knowledge Discovery from
    Biomedical Data
  • Sriharsha Veeramachaneni, ITC-IRST
  • Francesca Demichelis, ITC-IRST, Harvard Medical
    School
  • Emanuele Olivetti, ITC-IRST
  • Paolo Avesani, ITC-IRST

ECML/PKDD 2005, Porto, Portugal
2
Outline
  • Biomarker Evaluation for Cancer Characterization
  • Costs Involved
  • Formal Definition of Cost-Constrained Biomarker
    Evaluation
  • Outline of Solution
  • Experimental Results
  • Conclusions

3
Cancer Characterization
  • Development of cancer diagnostic/prognostic
    models from Data.
  • clinical features (e.g. status of the patient,
    relapse)
  • histological features (given by the pathologist)
  • grade of disease
  • tumor dimension
  • number of lymphonodes
  • ...
  • Molecular biomarker features
  • protein expression by immunohistochemistry
  • gene amplification by fluorescence in situ
    hybridization

4
Molecular biomarker features
  • Biomarkers are molecular reagents used to test
    biological samples,
  • that add information to the clinical and
    pathology data (improve predictive model for
    diagnosis/prognosis).
  • They should impact clinical care.

5
Tissue Micro Array (TMA)
  • TMA block is a paraffin block with 100-1000
    tissue cores in array fasion
  • each core is 0.6mm Ø from regions of interest of
    the donor block
  • each TMA block can be sliced in 100-500 thin
    slides
  • each slide tested independently
    (immunohistochemistry, florescent hybridization)

6
Donor Blocks
TMA block
TMA Slides
7
Biomarker Evaluation for Clinical Prediction
Process
  • Collect biological samples from monitored
    patients along with features (clinical,
    histological)
  • Set up an initial dataset. Each record is
  • C Clinical information (eg. Dead/Alive, Relapse)
  • X Histological features
  • Prepare TMA blocks from the biological samples.
  • add biomarkers and measure outcomes Y (intensity,
    extension)
  • Analyze the ability of the biomarker Y to predict
    C given X.
  • NOTE This is a one shot process (i.e. one
    experiment).

8
Cost Issues
  • Limitations in clinical analysis of tissues
  • limited availability of tissues
  • cumbersome and time consuming procedure
  • cost of biochemical reagents

9
Active Sampling for Feature Evaluation
Data set
Choose the most informative (C,X) to sample for Y
No
Sample at (C, X)
Evaluate predictive ability of Y
Budget Exhausted?
Yes
Output estimate of feature utility
10
Active Sampling for Biomarker Evaluation
  • collect known features from monitored patients
    (clinical, histological)
  • set up an initial dataset (C clinical , X
    histological)
  • create an initial model to predict C given
    already known X
  • select intelligently TMA block/slide to improve
    current model and pay for that information
  • add biomarkers and measure outcomes (Y)
  • build a new model to predict class labels C given
    X,Y features.
  • iterate, going back to (4) until budget exhausted
  • Predict error rate of the classifier that uses
    both X and Y
  • NOTE This is an iterative process.

11
Active Sampling Choosing the most informative
sample
  • We want to estimate the error-rate (e) of the
    classifier using both X and Y accurately.
  • We should probe where we think sampling will
    improve the estimate of the error-rate the most.
  • We do this by minimizing the expected MSE after
    sampling for each (C, X) and choose the lowest.
  • We show that the above criterion is equivalent to
    maximizing the expected difference in the
    estimates before and after sampling at (C, X).

12
Active Sampling Issues
  • Finding a heuristic to approximate the
    theoretically optimal sampling strategy.
  • Dealing with the bias of the sampling scheme.
  • Computational considerations.

13
Experiments Breast Cancer Dataset Description
  • 2 alternative class labels (clinical data)
  • C1 status of the patient (dead/alive)
  • C2 presence/absence of tumor relapse
  • 3 known features (histological data)
  • X1 tumor type
  • X2 pathologist's evaluation of lymphonodes
  • X3 pathologist's evaluation of morphology
  • 4 biomarkers features
  • Y1 nuclei expressing ER
  • Y2 nuclei expressing PGR
  • Y3 score colour intensity stained area by
    P53
  • Y4 score colour intensity stained area by
    cerbB
  • Size of dataset 400 record with missing values
    (see later)
  • All features reduced to binary values based on
    domain conventions.

14
Experiments Description
  • Class Label (C) Known Features (X)
    New Feature (Y) Size ()
  • I dead/alive all histological
    information PGR 160
  • II dead/alive all histological
    information P53 164
  • III dead/alive all histological
    information ER 152
  • IV dead/alive all histological
    information cerbB 170
  • V relapse all histological
    information PGR 157
  • VI relapse all histological
    information P53 161
  • VII relapse all histological
    information ER 149
  • VIII relapse all histological
    information cerbB 167
  • IX dead/alive PGR, P53, ER
    cerbB 196
  • X relapse PGR, P53, ER
    cerbB 198

15
Some Results
16
Conclusions
  • We presented a preliminary solution for
    cost-constrained biomarker evaluation
  • For the same accuracy in the prediction of
    feature relevance we need fewer samples.
  • For the application, this means that we need to
    use up less of the valuable biological tissue
    resource.
  • Allows testing more biomarkers on the same amount
    of tissue samples.

17
Future Work
  • Extend the algorithm to
  • Other classifiers
  • Continuous features
  • Batch active learning, where we sample more than
    one sample at a time
  • Study how much a myopic strategy for sampling is
    worse than non-myopic.

18
Thank You
Write a Comment
User Comments (0)
About PowerShow.com