A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES - PowerPoint PPT Presentation

About This Presentation
Title:

A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES

Description:

ONCOLOGY BIOSTATISTICS. JOHNS HOPKINS UNIVERSITY ' ... Lung cancer remains the leading cause of cancer deaths for men and women ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 34
Provided by: fox2
Learn more at: http://people.musc.edu
Category:

less

Transcript and Presenter's Notes

Title: A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES


1
A NESTED UNSUPERVISED APPROACH TOIDENTIFYING
NOVEL MOLECULARSUBTYPES
  • ELIZABETH GARRETT-MAYER
  • ONCOLOGY BIOSTATISTICS
  • JOHNS HOPKINS UNIVERSITY
  • "MCMSki" The Past, Present, and Future of Gibbs
    Sampling
  • Bormio, Italy
  • January 12-14, 2005

2
INTRODUCTION MOLECULAR SUBTYPING IN LUNG CANCER
  • Lung cancer remains the leading cause of cancer
    deaths for men and women
  • Lung cancer diagnosis includes evaluation of
  • type of cancer (e.g. non-small cell,
    adenocarcinoma)
  • location and size
  • lymph node involvement
  • evidence of metastases outside the lungs.
  • But, tumors with identical diagnosis often
  • progress differently,
  • respond to therapy differently
  • result in different long-term outcomes.

3
MOLECULAR SUBTYPING IN LUNG CANCER
  • Genome-wide analyses of gene expression profiles
    show promise different subclasses of tumors
    correspond to distinct gene expression patterns
  • Multiple studies in lung cancer have found gene
    expression profiles for lung cancer subtypes.
  • Bhattacharjee et al. (PNAS 2001)
  • Beer et al. (Nature Medicine 2002)
  • Garber et al. (PNAS 2001)
  • and more..
  • Some overlap and some disagreement between
    profiles.
  • Different technologies used (e.g. Affymetrix
    versus cDNA chips)
  • Different genes on the arrays.
  • Different statistical methods for developing
    profiles
  • Validation Some are, but most are not
    validated.

4
MOLECULAR CLASSIFICATION
  • Goal To use expression data to identify or
    hypothesize subtypes of cancer that are as yet
    undefined.
  • Eventually, wed like to be able to have
    individualized prognoses and therapy based on
    molecular profiles
  • Success story Gefitinib (Iressa)
  • Non-small cell lung cancers
  • Those with EGFR protein mutation have high
    probability of response
  • Clinical test developed for screening lung cancer
    patients
  • We need additional new classes that are
  • Interpretable (biologically)
  • Amenable to further analyses
  • Translatable into clinical tools

5
STAGES OF MOLECULAR CLASSIFICATION
  • Dimension reduction
  • We start with too many genes we need to pare it
    down
  • Subtype identification
  • Identify homogenenous clusters of samples
  • Ideally, based on outcome data
  • Expert elicitation
  • We do not want all genes related to subtypes
  • Ideally small, non-redundant set of genes that
    is highly predictive of subtype/outcome

6
DESIGN OF MICROARRAY STUDIES
  • Samples included
  • All cancers
  • Cancers plus some normals or other types (e.g.
    non-malignant disease)
  • Often few samples
  • Sometimes we have outcome data
  • Time to progression
  • Time to death
  • Response rate
  • Our data example 156 lung samples
    (Bhattacharjee et al., 2001)
  • Affymetrix chips used for measuring expression
  • 139 adenocarcinomas and 17 normal samples
  • 5665 genes available for analysis
  • no outcome data available

7
COMMON WAY OF SEEING MICROARRAY DATA PRESENTED
Garber et al. 2001, PNAS
8
MOLECULAR PROFILE OF THREE GENES
Gene A Gene B Gene
C Profile 1 -1 -1 -1 Profile
2 -1 -1 0 Profile 3 -1 -1 1 Profile
4 . . . . . . . . . . . . . . . Pro
file 26 1 1 0 Profile 27 1 1 1
where -1 underexpressed 0 normally
expressed 1 overexpressed
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
LATENT EXPRESSION CLASSES
The proportion of underexpressed and
overexpressed samples for each gene g are
defined by
13
POE PROBABILITY OF EXPRESSION
Variation across samples (population variation)
14
(No Transcript)
15
POE PROBABILITY OF EXPRESSION
  • Sometimes we have relatively few samples
  • Borrow strength across genes
  • Bayesian hierarchical model for gene-specific
    parameters
  • Constrain parameters such that

16
Special Case Normal samples included
  • If tumor sample t is normal, then
  • If tumor sample t is not normal, then
  • Allows us to define the normal component of the
    mixture distribution

17
(No Transcript)
18
ESTIMATION PROCEDURE
  • MCMC with Metropolis-Hastings algorithm in R.
  • Takes too long (overnight with 200 samples, 10000
    genes)
  • Currently being reprogrammed in C
  • Tried WinBUGS, but could not program a mixture of
    1 normal and two uniforms.
  • Data are augmented with trichotomous indicator
    egt for each agt (Diebolt and Robert, 1994)
  • egt is not fully missing egt 0 if normal
    sample

19
ESTIMATION PROCEDURE
  • Sampling of ? parameters
  • where ? represents the full set of parameters,
    and ? is ? with ? removed.
  • ? ? e?,? combine so that we are sampling
    them from ?, e?
  • Facilitates mixing of the ? parameters (can be a
    problem if there are few or no samples in the
    uniform components)

20
ESTIMATION PROCEDURE
  • Why a mixture of uniforms and normals?
  • Mathematically
  • Identifiability is an issue due to small sample
    size
  • Fewer parameters than a mixture of three normals
  • Three component normal mixture has 6 parameters
    (µ1, s1, µ2, s2, µ3, s3)
  • Our parameterization has 4 parameters (?, ?-, µ,
    s)
  • No points are assigned very low densities
  • Estimates are more stable
  • Practically
  • Gaussian errors are reasonable for measuring gene
    expression
  • Cancer is often thought to be caused by failure
    of some biological mechanism -gt expressions in
    cancer can take broad range of values

21
POE TRANSFORMATION
  • Each data point, agt, is transformed to the POE
    scale

22
POE TRANSFORMATION
  • Does not depend on original units of measure
    (e.g. absolute expression versus log-ratios)
  • Probability scale (loosely)
  • Long term goal studies using different
    technologies can be represented in the same unit
    free scale
  • Denoises!

23
SIMULATED DATA EXAMPLE
Original scale
POE scale
24
LUNG CANCER DATA EXAMPLE
25
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES
  • For each gene, determine based on a fixed
    threshold p0 (e.g. p0 0.50)
  • Calculate sensitivities and specificities for
    each gene
  • Knowing which samples are normal allows us to
    compute these quantities
  • We can screen genes at this stage, discarding
    genes with poor predictive power

26
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES
Assume
27
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES
  • Better approach exploits MCMC estimation
  • We spent all this (computational) time sampling
    egt at each iteration of chain! Lets make
    better use of them.
  • Calculate sensitivities and specificities as part
    of the chain, using sampled trichotomous
    indicators.
  • Better estimates of sensitivities and
    specificities
  • Posterior distributions
  • Does not rely on (arbitrary) cutoff p0

28
SPECIFICITY
SENSITIVITY
29
CLASSIFICATION GENE MINING
  • Choose an expression pattern of interest. The
    idea is to state a target for how many samples
    are expected to show low expression and how many
    to show high expression for a gene. For
    example, the pattern 0.05,0.20 indicates that
    5 of samples should be low, and 20 should be
    high for a gene. The remaining 75 would then be
    in the typical'' component of the mixture.
  • 2. Sort genes according to consistency with
    low-high'' distribution defined in step 1.
    Using the estimates of pgt we can calculate, for
    each gene g, the probability that the
    distribution of over and under expression among
    the samples is the same as in the specified
    low-high distribution. We sort genes by this
    probability.
  • 3. Choose the gene with the largest probability
    from step 2 and which is sufficiently coherent as
    the seed'' gene (i.e., rgg gt rc
  • where rc is the cutoff for gene coherence).
  • 4. Choose genes that show substantial agreement
    with the seed gene, either as a fixed agreement
    cutoff, or as a proportion of coherence of the
    seed variable. Add these genes to the group''
    which is seeded by gene chosen in step 3.
  • 5. Remove the genes in the group defined in step
    4 from further
  • consideration. Repeat steps 3 and 4 to
    identify remaining groups.

30
GENE PROFILES
  • Three genes selected for profiling
  • BRCA1 (breast cancer 1) tumor suppressor gene
    related to familial breast/ovarian cancer and
    other cancers
  • MEIS1 (myeloid ecotropic viral integration)
    transcription factor related to oncogenesis
  • FGF7 (fibroblast growth factor 7) related to
    lung development

31
GENE PROFILES
MEIS1
FGF7
BRCA1
32
OTHER POINTS
  • CAVEAT Weak Identifiability
  • ?s only meaningful when enough samples in
    over- and under-expression components
  • If sample size is small.
  • Future/Other work
  • Normal does not have to be normal
  • Gefitinib analogy
  • Applications in breast cancer, lung cancer, AML.

33
ACKNOWLEDGEMENTS AND REFERENCES
  • Giovanni Parmigiani
  • Ed Gabrielson
  • Jiang Huang
  • Xiaogang Zhong
  • Garrett, E.S., Parmigiani, G. A nested
    unsupervised approach to identifying novel
    molecular subtypes. Bernoulli, 10(6), 2004.
  • Garrett, E.S., Parmigiani, G. POE Statistical
    Methods for Qualitative Analysis of Gene
    Expression. In The Analysis of Gene Expression
    Data Methods and Software (eds. G. Parmigiani,
    E.S. Garrett, R.A. Irizarry, S.L. Zeger) Chapter
    16, Springer New York, 2003.
  • Parmigiani, G., Garrett, E., Anbazhagan, R.,
    Gabrielson, E. A Statistical Framework
    forExpression-Based Molecular Classification in
    Cancer. Journal of Royal Statistical Society,
    Series B, with discussion, 64 717-736, 2002.
  • Scharpf, R., Garrett, E.S., Hu, J., Parmigiani,
    G. Statistical Modeling and Visualization of
    Molecular Profiles in Cancer. Biotechniques, 34
    S22-S29, 2003.
Write a Comment
User Comments (0)
About PowerShow.com