Title: An Introduction to the Statistical Analysis of Microarray Gene Expression Data
1An Introduction to the Statistical Analysis of
Microarray Gene Expression Data
Geoff McLachlan and Liat Ben-Tovim
Jones Department of Mathematics Institute for
Molecular Bioscience University of
Queensland Bioinformatics Symposium HELP 2005
2Bioinformaticians
Richard Bean PhD Discrete Maths UQ
Liat Ben-Tovim Jones PhD Biochemistry, MD
Medicine Cambridge University
3Outline of Talk
- Introduction
- The Microarray Experiment
- The Problem of Detecting Differentially
Expressed Genes - Cluster Analysis Clustering Tissues and Genes
4Although the definition of bioinformatics as a
discipline is still emerging, it is undoubtedly
clear that the information sciences
(statistics, mathematics, IT) are a major part
of it.
5A microarray is a new technology which allows the
measurement of the expression levels of thousands
of genes simultaneously.
- (1) Sequencing of the genome (human, mouse, and
others) - (2) Improvement in technology to generate
high-density - arrays on chips (glass slides or nylon
membrane)
The entire genome of an organism can be probed at
a single point in time.
6Draft of the Human Genome
Public Sequence Nature, Feb. 2001
Celera Sequence Science, Feb. 2001
7High Throughput omics Experiments
- Genomics study of DNA (MICROARRAYS)
- Proteomics study of Proteins
- Metabolomics study of Metabolites
-
- Experiments on a Genome-Wide scale!
8Microarrays are difficult to analyze
Microarrays present new problems for statistics
because the data are very high dimensional with
very little replication.
The challenge is to extract useful information
and discover knowledge from the data, such as
gene functions, gene interactions, regulatory
pathways, metabolic pathways etc.
9Outline of Talk
- Introduction
- The Microarray Experiment
- The Problem of Detecting Differentially
Expressed Genes - Cluster Analysis Clustering Tissues and Genes
10The Central Dogma of Molecular Biology
Every cell contains the same DNA.
The activity of a gene (expression) can be
determined by the presence of its complementary
mRNA.
Cells differ in the DNA (gene) which is active at
any one time.
Genes code for proteins through the intermediary
of mRNA.
11Gene Expression Studies
- Pattern of genes expressed in a cell is
characteristic of its current state
- Virtually all differences in cell state or type
are correlated with changes in mRNA levels of
many genes
- DNA microarray technology originally conceived
in order to detect expression of thousands of
genes simultaneously
12The Microarray Experiment
- DNA complementary to the genes of interest is
generated and laid out in microscopic quantities
on solid surfaces at defined positions
- DNA (mRNA) from samples is eluted over the
surface complementary DNA binds
- Presence of bound DNA is detected by
fluorescence following laser excitation
13Steps in the Microarray Experiment
- mRNA is extracted from the cell
- mRNA is reverse transcribed to cDNA (mRNA itself
is unstable)
- cDNA is labeled with fluorescent dye TARGET
- The sample is hybridized to known DNA sequences
on the array - (tens of thousands of genes) PROBE
- If present, complementary target binds to probe
DNA - (complementary base pairing)
- Target bound to probe DNA fluoresces
14Spotted cDNA Microarray
Compare the gene expression levels for two cell
populations on a single microarray.
15(No Transcript)
16 Microarray Image Red High expression in
target labelled with cyanine 5 dye Green High
expression in target labelled with cyanine 3
dye Yellow Similar expression in both target
samples
17Assumptions
Gene Expression
(1)
cellular mRNA levels directly reflect gene
expression
mRNA
intensity of bound target is a measure of the
abundance of the mRNA in the sample.
(2)
Fluorescence Intensity
18Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
19The Microarray Technologies
Spotted Microarray (cDNA array)
Use robot to spot glass slides at precise points
with gene/EST sequences (homemade)
relative gene expressions
Affymetrix GeneChip
Uses photolithography to synthesize short
oligonucleotides onto glass wafer
absolute gene expressions
Each with its own advantages and disadvantages
20Aims of a Microarray Experiment
- observe changes in a gene in response to
external stimuli - (cell samples exposed to hormones, drugs,
toxins) - compare gene expressions between different
tissue types - (tumour vs normal cell samples)
- To gain understanding of
- function of unknown genes
- disease process at the molecular level
- Ultimately to use as tools in Clinical Medicine
for diagnosis, - prognosis and therapeutic management.
21Importance of Experimental Design
- Good DNA microarray experiments should have
clear objectives. - not performed as aimless data mining in search
of unanticipated patterns that will provide
answers to unasked - questions
- (Richard Simon, BioTechniques 34S16-S21, 2003)
22Extracting Data from the Microarray
- Cleaning
- Image processing
- Filtering
- Missing value estimation
- Normalization
- Remove sources of systematic variation.
23Gene Expressions from Measured Intensities
Spotted Microarray
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix
(Perfect Match Intensity Mismatch Intensity)
24Data Transformation
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
25(No Transcript)
26Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
27Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Gene expressions can be shown as Heat Maps
Expression Profile
28(No Transcript)
29(No Transcript)
30Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
31Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
32Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
33Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
34 Selection Bias Bias that occurs
when a subset of the variables is selected
(dimension reduction) in some optimal way, and
then the predictive capability of this subset is
assessed in the usual way i.e. using an ordinary
measure for a set of variables.
35Nature Reviews Cancer, Feb. 2005
36Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n
Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
37Supervised Classification of Tissue Samples
38Outline of Talk
- Introduction
- The Microarray Experiment
- The Problem of Detecting Differentially
Expressed Genes - Cluster Analysis Clustering Tissues and Genes
39(No Transcript)
40Class 1
Class 2
41Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
42Test of a Single Hypothesis
For gene j, let Hj 0 denote that the null
hypothesis of no association between its
expression level and its class membership holds,
where (j 1, , N).
Hj 0 Null hypothesis for the j th gene
holds. Hj 1 Null hypothesis for the j th gene
does not hold.
43Gene Statistics Two-Sample t-Statistic
Students t-statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
Modified t-statistic of Tusher et al. (2001)
44Multiplicity Problem
Perform a test for each gene to determine the
statistical significance of differential
expression for that gene.
Problem When many hypotheses are tested, the
probability of a type I error (false positive)
increases sharply with the number of hypotheses.
Further Genes are co-regulated, subsequently
there is correlation between the test statistics.
45Methods for dealing with the Multiplicity Problem
- The Bonferroni Method
- controls the family wise error rate (FWER) i.e.
- the probability that at least one false positive
error will be made
- Too strict for gene expression data, tries to
make it unlikely that even one false rejection of
the null is made, may lead to missed findings
- The False Discovery Rate (FDR)
- emphasizes the proportion of false positives
among the identified differentially expressed
genes.
- Good for gene expression data says something
about the chosen genes
46False Discovery Rate Benjamini and Hochberg
(1995)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
47Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
48Using just the B permutations of the class labels
for the gene-specific statistic Tj , the P-value
for Tj tj is assessed as
where t(b)0j is the null version of tj after the
bth permutation of the class labels.
49As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
50If we pool over all N genes, then
51Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are. Efron et al. (2001)
52Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
53Procedure
- Form a statistic wj for each gene. A large
positive value of wj corresponds to a gene that
is DE.
2. Fit to w1,...,wN a mixture of two normal
densities with the 1st component as a standard
normal - genes that are not DE. Assume that wj
have been transformed so that they are approx.
normally distributed e.g for the ANOVA statistic
F (Broet et al., 2004).
wj
3. Let 0(wj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
54If we conclude that gene j is differentially
expressed if
0(wj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
55Estimated FDR
where
56Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to
obtain gene expression profiles of tumours from
carriers of either the BRCA1 or BRCA2 mutation
(hereditary breast cancers), as well as sporadic
breast cancer. We consider their data set of M
15 patients, comprising two patient groups
BRCA1 (7) versus BRCA2 - mutation positive (8),
with N 3,226 genes.
The problem is to find genes that are
differentially expressed between the BRCA1 and
BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
57Two component model for the Breast Cancer Data
Fit to the N values of wj (based on pooled
two-sample t-statistic)
j th gene is taken to be differentially expressed
if
58Estimated FDR for various levels of c0
59 Significant Genes (Hedenfalk Breast Cancer Data)
- 175 genes selected as significant by our method
- 137 of these over-expressed in BRCA-1 relative to
- BRCA-2, including MSH2 (DNA repair),
- PDCD5 (apoptosis)
Compare Storey and Tibshirani (160 genes) and
Hedenfalk (176 genes), gives 23 genes unique to
our set.
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
60 Uniquely Identified Genes
61 SAM (v. 2) Method for finding DE genes
- 210 genes selected as significant with an FDR of
5 - Compare to our 174 genes, 152 common genes
- Compare to 160 (Storey and Tibshirani), 132
common
SAM method of Tusher et al. (2001) PNAS, 98,
5116-5121
62Conclusions
- Mixture-model based approach to finding DE genes
can yield new information - Gives a measure of the posterior probability
that a gene is not DE (i.e. a local FDR rather
than global) - Can be used in the spirit of the q-value, to
bound the FDR at a desired level
63Outline of Talk
- Introduction
- The Microarray Experiment
- The Problem of Detecting Differentially Expressed
Genes - Cluster Analysis Clustering Tissues and Genes
64Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
65Two Clustering Problems
Clustering of genes on basis of tissues
genes not independent
(n N, p M)
- Clustering of tissues on basis of genes -
- latter is a nonstandard problem in
- cluster analysis
(n M, p N, so n ltlt p)
66UNSUPERVISED CLASSIFICATION (CLUSTER
ANALYSIS) INFER CLASS LABELS z1, , zn of y1,
, yn
Initially, hierarchical distance-based methods of
cluster analysis were used to cluster the tissues
and the genes Eisen, Spellman, Brown, Botstein
(1998, PNAS)
67The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
68In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
69MIXTURE OF g NORMAL COMPONENTS
70PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
71(No Transcript)
72Example Microarray DataColon Data of Alon et
al. (1999)
M 62 (40 tumours 22 normals) tissue samples
of N 2,000 genes in a 2,000 ? 62 matrix.
73(No Transcript)
74(No Transcript)
75(No Transcript)
76Clustering of COLON Data Genes using EMMIX-GENE
77Grouping for Colon Data
78(No Transcript)
79(No Transcript)
80Clustering of COLON Data Tissues using EMMIX-GENE
81Grouping for Colon Data
82Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
83 vant Veer De Jong (2002, Nature Medicine
8) The microarray way to tailored cancer
treatment
84Heat Map Displaying the Reduced Set of 4,869
Genes on the 98 Breast Cancer Tumours
85Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91where i group number mi number in group
i Ui -2 log ?i
92Heat Map of Genes in Group G1
93Heat Map of Genes in Group G2
94Heat Map of Genes in Group G3
95(No Transcript)
96Use of Microarray Data via Model-Based
Classification in the Study and Prediction of
Survival from Lung Cancer Liat Ben-Tovim Jones,
Shu-Kay Ng, Christophe Ambroise, Katrina Monico,
Nazim Khan and Geoff McLachlan The 4th Critical
Assessment of Microarray Data Analysis Conference
(CAMDA03)
97Institute for Molecular Bioscience, University
of Queensland