A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES - PowerPoint PPT Presentation

About This Presentation

Title:

A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES

Description:

ONCOLOGY BIOSTATISTICS. JOHNS HOPKINS UNIVERSITY ' ... Lung cancer remains the leading cause of cancer deaths for men and women ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 34

Provided by: fox2

Learn more at: http://people.musc.edu

Category:

more less

Transcript and Presenter's Notes

Title: A NESTED UNSUPERVISED APPROACH TO IDENTIFYING NOVEL MOLECULAR SUBTYPES

1
A NESTED UNSUPERVISED APPROACH TOIDENTIFYING
NOVEL MOLECULARSUBTYPES

ELIZABETH GARRETT-MAYER
ONCOLOGY BIOSTATISTICS
JOHNS HOPKINS UNIVERSITY
"MCMSki" The Past, Present, and Future of Gibbs
Sampling
Bormio, Italy
January 12-14, 2005

2
INTRODUCTION MOLECULAR SUBTYPING IN LUNG CANCER

Lung cancer remains the leading cause of cancer
deaths for men and women
Lung cancer diagnosis includes evaluation of
type of cancer (e.g. non-small cell,
adenocarcinoma)
location and size
lymph node involvement
evidence of metastases outside the lungs.
But, tumors with identical diagnosis often
progress differently,
respond to therapy differently
result in different long-term outcomes.

3
MOLECULAR SUBTYPING IN LUNG CANCER

Genome-wide analyses of gene expression profiles
show promise different subclasses of tumors
correspond to distinct gene expression patterns
Multiple studies in lung cancer have found gene
expression profiles for lung cancer subtypes.
Bhattacharjee et al. (PNAS 2001)
Beer et al. (Nature Medicine 2002)
Garber et al. (PNAS 2001)
and more..
Some overlap and some disagreement between
profiles.
Different technologies used (e.g. Affymetrix
versus cDNA chips)
Different genes on the arrays.
Different statistical methods for developing
profiles
Validation Some are, but most are not
validated.

4
MOLECULAR CLASSIFICATION

Goal To use expression data to identify or
hypothesize subtypes of cancer that are as yet
undefined.
Eventually, wed like to be able to have
individualized prognoses and therapy based on
molecular profiles
Success story Gefitinib (Iressa)
Non-small cell lung cancers
Those with EGFR protein mutation have high
probability of response
Clinical test developed for screening lung cancer
patients
We need additional new classes that are
Interpretable (biologically)
Amenable to further analyses
Translatable into clinical tools

5
STAGES OF MOLECULAR CLASSIFICATION

Dimension reduction
We start with too many genes we need to pare it
down
Subtype identification
Identify homogenenous clusters of samples
Ideally, based on outcome data
Expert elicitation
We do not want all genes related to subtypes
Ideally small, non-redundant set of genes that
is highly predictive of subtype/outcome

6
DESIGN OF MICROARRAY STUDIES

Samples included
All cancers
Cancers plus some normals or other types (e.g.
non-malignant disease)
Often few samples
Sometimes we have outcome data
Time to progression
Time to death
Response rate
Our data example 156 lung samples
(Bhattacharjee et al., 2001)
Affymetrix chips used for measuring expression
139 adenocarcinomas and 17 normal samples
5665 genes available for analysis
no outcome data available

7
COMMON WAY OF SEEING MICROARRAY DATA PRESENTED
Garber et al. 2001, PNAS
8
MOLECULAR PROFILE OF THREE GENES
Gene A Gene B Gene
C Profile 1 -1 -1 -1 Profile
2 -1 -1 0 Profile 3 -1 -1 1 Profile
4 . . . . . . . . . . . . . . . Pro
file 26 1 1 0 Profile 27 1 1 1
where -1 underexpressed 0 normally
expressed 1 overexpressed
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
LATENT EXPRESSION CLASSES
The proportion of underexpressed and
overexpressed samples for each gene g are
defined by
13
POE PROBABILITY OF EXPRESSION
Variation across samples (population variation)
14
(No Transcript)
15
POE PROBABILITY OF EXPRESSION

Sometimes we have relatively few samples
Borrow strength across genes
Bayesian hierarchical model for gene-specific
parameters
Constrain parameters such that

16
Special Case Normal samples included

If tumor sample t is normal, then
If tumor sample t is not normal, then
Allows us to define the normal component of the
mixture distribution

17
(No Transcript)
18
ESTIMATION PROCEDURE

MCMC with Metropolis-Hastings algorithm in R.
Takes too long (overnight with 200 samples, 10000
genes)
Currently being reprogrammed in C
Tried WinBUGS, but could not program a mixture of
1 normal and two uniforms.
Data are augmented with trichotomous indicator
egt for each agt (Diebolt and Robert, 1994)
egt is not fully missing egt 0 if normal
sample

19
ESTIMATION PROCEDURE

Sampling of ? parameters
where ? represents the full set of parameters,
and ? is ? with ? removed.
? ? e?,? combine so that we are sampling
them from ?, e?
Facilitates mixing of the ? parameters (can be a
problem if there are few or no samples in the
uniform components)

20
ESTIMATION PROCEDURE

Why a mixture of uniforms and normals?
Mathematically
Identifiability is an issue due to small sample
size
Fewer parameters than a mixture of three normals
Three component normal mixture has 6 parameters
(µ1, s1, µ2, s2, µ3, s3)
Our parameterization has 4 parameters (?, ?-, µ,
s)
No points are assigned very low densities
Estimates are more stable
Practically
Gaussian errors are reasonable for measuring gene
expression
Cancer is often thought to be caused by failure
of some biological mechanism -gt expressions in
cancer can take broad range of values

21
POE TRANSFORMATION

Each data point, agt, is transformed to the POE
scale

22
POE TRANSFORMATION

Does not depend on original units of measure
(e.g. absolute expression versus log-ratios)
Probability scale (loosely)
Long term goal studies using different
technologies can be represented in the same unit
free scale
Denoises!

23
SIMULATED DATA EXAMPLE
Original scale
POE scale
24
LUNG CANCER DATA EXAMPLE
25
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES

For each gene, determine based on a fixed
threshold p0 (e.g. p0 0.50)
Calculate sensitivities and specificities for
each gene
Knowing which samples are normal allows us to
compute these quantities
We can screen genes at this stage, discarding
genes with poor predictive power

26
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES
Assume
27
EVALUATING DIAGNOSTIC CHARACTERISTICS OF GENES

Better approach exploits MCMC estimation
We spent all this (computational) time sampling
egt at each iteration of chain! Lets make
better use of them.
Calculate sensitivities and specificities as part
of the chain, using sampled trichotomous
indicators.
Better estimates of sensitivities and
specificities
Posterior distributions
Does not rely on (arbitrary) cutoff p0

28
SPECIFICITY
SENSITIVITY
29
CLASSIFICATION GENE MINING

Choose an expression pattern of interest. The
idea is to state a target for how many samples
are expected to show low expression and how many
to show high expression for a gene. For
example, the pattern 0.05,0.20 indicates that
5 of samples should be low, and 20 should be
high for a gene. The remaining 75 would then be
in the typical'' component of the mixture.
2. Sort genes according to consistency with
low-high'' distribution defined in step 1.
Using the estimates of pgt we can calculate, for
each gene g, the probability that the
distribution of over and under expression among
the samples is the same as in the specified
low-high distribution. We sort genes by this
probability.
3. Choose the gene with the largest probability
from step 2 and which is sufficiently coherent as
the seed'' gene (i.e., rgg gt rc
where rc is the cutoff for gene coherence).
4. Choose genes that show substantial agreement
with the seed gene, either as a fixed agreement
cutoff, or as a proportion of coherence of the
seed variable. Add these genes to the group''
which is seeded by gene chosen in step 3.
5. Remove the genes in the group defined in step
4 from further
consideration. Repeat steps 3 and 4 to
identify remaining groups.

30
GENE PROFILES

Three genes selected for profiling
BRCA1 (breast cancer 1) tumor suppressor gene
related to familial breast/ovarian cancer and
other cancers
MEIS1 (myeloid ecotropic viral integration)
transcription factor related to oncogenesis
FGF7 (fibroblast growth factor 7) related to
lung development

31
GENE PROFILES
MEIS1
FGF7
BRCA1
32
OTHER POINTS

CAVEAT Weak Identifiability
?s only meaningful when enough samples in
over- and under-expression components
If sample size is small.
Future/Other work
Normal does not have to be normal
Gefitinib analogy
Applications in breast cancer, lung cancer, AML.

33
ACKNOWLEDGEMENTS AND REFERENCES

Giovanni Parmigiani
Ed Gabrielson
Jiang Huang
Xiaogang Zhong

Garrett, E.S., Parmigiani, G. A nested
unsupervised approach to identifying novel
molecular subtypes. Bernoulli, 10(6), 2004.
Garrett, E.S., Parmigiani, G. POE Statistical
Methods for Qualitative Analysis of Gene
Expression. In The Analysis of Gene Expression
Data Methods and Software (eds. G. Parmigiani,
E.S. Garrett, R.A. Irizarry, S.L. Zeger) Chapter
16, Springer New York, 2003.
Parmigiani, G., Garrett, E., Anbazhagan, R.,
Gabrielson, E. A Statistical Framework
forExpression-Based Molecular Classification in
Cancer. Journal of Royal Statistical Society,
Series B, with discussion, 64 717-736, 2002.
Scharpf, R., Garrett, E.S., Hu, J., Parmigiani,
G. Statistical Modeling and Visualization of
Molecular Profiles in Cancer. Biotechniques, 34
S22-S29, 2003.