An Introduction to the Statistical Analysis of Microarray Gene Expression Data - PowerPoint PPT Presentation

1 / 97
About This Presentation
Title:

An Introduction to the Statistical Analysis of Microarray Gene Expression Data

Description:

... as a discipline is still emerging, it is undoubtedly clear that the ... gene expression studies are not a passing fashion, but are instead one aspect of ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 98
Provided by: elearnin3
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to the Statistical Analysis of Microarray Gene Expression Data


1
An Introduction to the Statistical Analysis of
Microarray Gene Expression Data
Geoff McLachlan and Liat Ben-Tovim
Jones Department of Mathematics Institute for
Molecular Bioscience University of
Queensland Bioinformatics Symposium HELP 2005
2
Bioinformaticians
Richard Bean PhD Discrete Maths UQ
Liat Ben-Tovim Jones PhD Biochemistry, MD
Medicine Cambridge University
3
Outline of Talk
  • Introduction
  • The Microarray Experiment
  • The Problem of Detecting Differentially
    Expressed Genes
  • Cluster Analysis Clustering Tissues and Genes

4
Although the definition of bioinformatics as a
discipline is still emerging, it is undoubtedly
clear that the information sciences
(statistics, mathematics, IT) are a major part
of it.
5
A microarray is a new technology which allows the
measurement of the expression levels of thousands
of genes simultaneously.
  • (1) Sequencing of the genome (human, mouse, and
    others)
  • (2) Improvement in technology to generate
    high-density
  • arrays on chips (glass slides or nylon
    membrane)

The entire genome of an organism can be probed at
a single point in time.
6
Draft of the Human Genome
Public Sequence Nature, Feb. 2001
Celera Sequence Science, Feb. 2001
7
High Throughput omics Experiments
  • Genomics study of DNA (MICROARRAYS)
  • Proteomics study of Proteins
  • Metabolomics study of Metabolites
  • Experiments on a Genome-Wide scale!

8
Microarrays are difficult to analyze
Microarrays present new problems for statistics
because the data are very high dimensional with
very little replication.
The challenge is to extract useful information
and discover knowledge from the data, such as
gene functions, gene interactions, regulatory
pathways, metabolic pathways etc.
9
Outline of Talk
  • Introduction
  • The Microarray Experiment
  • The Problem of Detecting Differentially
    Expressed Genes
  • Cluster Analysis Clustering Tissues and Genes

10
The Central Dogma of Molecular Biology
Every cell contains the same DNA.
The activity of a gene (expression) can be
determined by the presence of its complementary
mRNA.
Cells differ in the DNA (gene) which is active at
any one time.
Genes code for proteins through the intermediary
of mRNA.
11
Gene Expression Studies
  • Pattern of genes expressed in a cell is
    characteristic of its current state
  • Virtually all differences in cell state or type
    are correlated with changes in mRNA levels of
    many genes
  • DNA microarray technology originally conceived
    in order to detect expression of thousands of
    genes simultaneously

12
The Microarray Experiment
  • DNA complementary to the genes of interest is
    generated and laid out in microscopic quantities
    on solid surfaces at defined positions
  • DNA (mRNA) from samples is eluted over the
    surface complementary DNA binds
  • Presence of bound DNA is detected by
    fluorescence following laser excitation

13
Steps in the Microarray Experiment
  • mRNA is extracted from the cell
  • mRNA is reverse transcribed to cDNA (mRNA itself
    is unstable)
  • cDNA is labeled with fluorescent dye TARGET
  • The sample is hybridized to known DNA sequences
    on the array
  • (tens of thousands of genes) PROBE
  • If present, complementary target binds to probe
    DNA
  • (complementary base pairing)
  • Target bound to probe DNA fluoresces

14
Spotted cDNA Microarray
Compare the gene expression levels for two cell
populations on a single microarray.
15
(No Transcript)
16
Microarray Image Red High expression in
target labelled with cyanine 5 dye Green High
expression in target labelled with cyanine 3
dye Yellow Similar expression in both target
samples
17
Assumptions
Gene Expression
(1)
cellular mRNA levels directly reflect gene
expression
mRNA
intensity of bound target is a measure of the
abundance of the mRNA in the sample.
(2)
Fluorescence Intensity
18
Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
19
The Microarray Technologies
Spotted Microarray (cDNA array)
Use robot to spot glass slides at precise points
with gene/EST sequences (homemade)
relative gene expressions
Affymetrix GeneChip
Uses photolithography to synthesize short
oligonucleotides onto glass wafer

absolute gene expressions
Each with its own advantages and disadvantages
20
Aims of a Microarray Experiment
  • observe changes in a gene in response to
    external stimuli
  • (cell samples exposed to hormones, drugs,
    toxins)
  • compare gene expressions between different
    tissue types
  • (tumour vs normal cell samples)
  • To gain understanding of
  • function of unknown genes
  • disease process at the molecular level
  • Ultimately to use as tools in Clinical Medicine
    for diagnosis,
  • prognosis and therapeutic management.

21
Importance of Experimental Design
  • Good DNA microarray experiments should have
    clear objectives.
  • not performed as aimless data mining in search
    of unanticipated patterns that will provide
    answers to unasked
  • questions
  • (Richard Simon, BioTechniques 34S16-S21, 2003)

22
Extracting Data from the Microarray
  • Cleaning
  • Image processing
  • Filtering
  • Missing value estimation
  • Normalization
  • Remove sources of systematic variation.

23
Gene Expressions from Measured Intensities
Spotted Microarray
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix
(Perfect Match Intensity Mismatch Intensity)
24
Data Transformation
Rocke and Durbin (2001), Munson (2001), Durbin et
al. (2002), and Huber et al. (2002)
25
(No Transcript)
26
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
27
Representation of Data from M Microarray
Experiments
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Assume we have extracted gene expressions values
from intensities.
Expression Signature
Gene expressions can be shown as Heat Maps
Expression Profile
28
(No Transcript)
29
(No Transcript)
30
Large-scale gene expression studies are not a
passing fashion, but are instead one aspect of
new work of biological experimentation, one
involving large-scale, high throughput assays.
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and Hall/ CRC
31
Growth of microarray and microarray methodology
literature listed in PubMed from 1995 to 2003.
The category all microarray papers includes
those found by searching PubMed for microarray
OR gene expression profiling. The category
statistical microarray papers includes those
found by searching PubMed for statistical
method OR statistical techniq OR
statistical approach AND microarray OR gene
expression profiling.
32
Mehta et al (Nature Genetics, Sept. 2004)
The field of expression data analysis is
particularly active with novel analysis
strategies and tools being published weekly, and
the value of many of these methods is
questionable. Some results produced by using
these methods are so anomalous that a breed of
forensic statisticians (Ambroise and McLachlan,
2002 Baggerly et al., 2003) who doggedly detect
and correct other HDB (high-dimensional biology)
investigators prominent mistakes, has been
created.
33
Selection bias in gene extraction on the basis
of microarray gene-expression data
Ambroise and McLachlan Proceedings of the
National Academy of Sciences Vol. 99, Issue 10,
6562-6566, May 14, 2002 http//www.pnas.org/cgi/
content/full/99/10/6562
34
Selection Bias Bias that occurs
when a subset of the variables is selected
(dimension reduction) in some optimal way, and
then the predictive capability of this subset is
assessed in the usual way i.e. using an ordinary
measure for a set of variables.
35
Nature Reviews Cancer, Feb. 2005
36
Supervised Classification (Two Classes)
. . . . . . .
Sample 1
Sample n
Gene 1
. . . . . . .
Gene p
Class 2 (poor prognosis)
Class 1 (good prognosis)
37
Supervised Classification of Tissue Samples
38
Outline of Talk
  • Introduction
  • The Microarray Experiment
  • The Problem of Detecting Differentially
    Expressed Genes
  • Cluster Analysis Clustering Tissues and Genes

39
(No Transcript)
40
Class 1
Class 2
41
Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
42
Test of a Single Hypothesis
For gene j, let Hj 0 denote that the null
hypothesis of no association between its
expression level and its class membership holds,
where (j 1, , N).
Hj 0 Null hypothesis for the j th gene
holds. Hj 1 Null hypothesis for the j th gene
does not hold.
43
Gene Statistics Two-Sample t-Statistic
Students t-statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
Modified t-statistic of Tusher et al. (2001)
44
Multiplicity Problem
Perform a test for each gene to determine the
statistical significance of differential
expression for that gene.
Problem When many hypotheses are tested, the
probability of a type I error (false positive)
increases sharply with the number of hypotheses.
Further Genes are co-regulated, subsequently
there is correlation between the test statistics.
45
Methods for dealing with the Multiplicity Problem
  • The Bonferroni Method
  • controls the family wise error rate (FWER) i.e.
  • the probability that at least one false positive
    error will be made
  • Too strict for gene expression data, tries to
    make it unlikely that even one false rejection of
    the null is made, may lead to missed findings
  • The False Discovery Rate (FDR)
  • emphasizes the proportion of false positives
    among the identified differentially expressed
    genes.
  • Good for gene expression data says something
    about the chosen genes

46
False Discovery Rate Benjamini and Hochberg
(1995)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
47
Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
48
Using just the B permutations of the class labels
for the gene-specific statistic Tj , the P-value
for Tj tj is assessed as
where t(b)0j is the null version of tj after the
bth permutation of the class labels.
49
As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
50
If we pool over all N genes, then
51
Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are. Efron et al. (2001)
52
Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
is the proportion that are.
Then
is the posterior probability that gene j is not
differentially expressed.
53
Procedure
  • Form a statistic wj for each gene. A large
    positive value of wj corresponds to a gene that
    is DE.

2. Fit to w1,...,wN a mixture of two normal
densities with the 1st component as a standard
normal - genes that are not DE. Assume that wj
have been transformed so that they are approx.
normally distributed e.g for the ANOVA statistic
F (Broet et al., 2004).
wj
3. Let 0(wj) denote the (estimated) posterior
probability that gene j belongs to the first
component of the mixture.
54
If we conclude that gene j is differentially
expressed if
0(wj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
55
Estimated FDR
where
56
Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to
obtain gene expression profiles of tumours from
carriers of either the BRCA1 or BRCA2 mutation
(hereditary breast cancers), as well as sporadic
breast cancer. We consider their data set of M
15 patients, comprising two patient groups
BRCA1 (7) versus BRCA2 - mutation positive (8),
with N 3,226 genes.
The problem is to find genes that are
differentially expressed between the BRCA1 and
BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
57
Two component model for the Breast Cancer Data
Fit to the N values of wj (based on pooled
two-sample t-statistic)
j th gene is taken to be differentially expressed
if
58
Estimated FDR for various levels of c0
59
Significant Genes (Hedenfalk Breast Cancer Data)
  • 175 genes selected as significant by our method
  • 137 of these over-expressed in BRCA-1 relative to
  • BRCA-2, including MSH2 (DNA repair),
  • PDCD5 (apoptosis)

Compare Storey and Tibshirani (160 genes) and
Hedenfalk (176 genes), gives 23 genes unique to
our set.
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
60
Uniquely Identified Genes
61
SAM (v. 2) Method for finding DE genes
  • 210 genes selected as significant with an FDR of
    5
  • Compare to our 174 genes, 152 common genes
  • Compare to 160 (Storey and Tibshirani), 132
    common

SAM method of Tusher et al. (2001) PNAS, 98,
5116-5121
62
Conclusions
  • Mixture-model based approach to finding DE genes
    can yield new information
  • Gives a measure of the posterior probability
    that a gene is not DE (i.e. a local FDR rather
    than global)
  • Can be used in the spirit of the q-value, to
    bound the FDR at a desired level

63
Outline of Talk
  • Introduction
  • The Microarray Experiment
  • The Problem of Detecting Differentially Expressed
    Genes
  • Cluster Analysis Clustering Tissues and Genes

64
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
N rows correspond to the N genes. M columns
correspond to the M samples (microarray
experiments).
Expression Profile
65
Two Clustering Problems
Clustering of genes on basis of tissues
genes not independent
(n N, p M)
  • Clustering of tissues on basis of genes -
  • latter is a nonstandard problem in
  • cluster analysis

(n M, p N, so n ltlt p)
66
UNSUPERVISED CLASSIFICATION (CLUSTER
ANALYSIS) INFER CLASS LABELS z1, , zn of y1,
, yn
Initially, hierarchical distance-based methods of
cluster analysis were used to cluster the tissues
and the genes Eisen, Spellman, Brown, Botstein
(1998, PNAS)
67
The notion of a cluster is not easy to
define. There is a very large literature devoted
to clustering when there is a metric known in
advance e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis. That
is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
68
In this case, one attractive feature of adopting
mixture models with elliptically symmetric
components such as the normal or t densities, is
that the implied clustering is invariant under
affine transformations of the data (that is,
under operations relating to changes in location,
scale, and rotation of the data). Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
69
MIXTURE OF g NORMAL COMPONENTS
70
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
71
(No Transcript)
72
Example Microarray DataColon Data of Alon et
al. (1999)
M 62 (40 tumours 22 normals) tissue samples
of N 2,000 genes in a 2,000 ? 62 matrix.
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
Clustering of COLON Data Genes using EMMIX-GENE
77
Grouping for Colon Data
78
(No Transcript)
79
(No Transcript)
80
Clustering of COLON Data Tissues using EMMIX-GENE
81
Grouping for Colon Data
82
Microarray to be used as routine clinical
screen by C. M. Schubert Nature Medicine 9, 9,
2003.
The Netherlands Cancer Institute in Amsterdam is
to become the first institution in the world to
use microarray techniques for the routine
prognostic screening of cancer patients. Aiming
for a June 2003 start date, the center will use a
panoply of 70 genes to assess the tumor profile
of breast cancer patients and to determine which
women will receive adjuvant treatment after
surgery.
83
vant Veer De Jong (2002, Nature Medicine
8) The microarray way to tailored cancer
treatment
84
Heat Map Displaying the Reduced Set of 4,869
Genes on the 98 Breast Cancer Tumours
85
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
where i group number mi number in group
i Ui -2 log ?i
92
Heat Map of Genes in Group G1
93
Heat Map of Genes in Group G2
94
Heat Map of Genes in Group G3
95
(No Transcript)
96
Use of Microarray Data via Model-Based
Classification in the Study and Prediction of
Survival from Lung Cancer Liat Ben-Tovim Jones,
Shu-Kay Ng, Christophe Ambroise, Katrina Monico,
Nazim Khan and Geoff McLachlan The 4th Critical
Assessment of Microarray Data Analysis Conference
(CAMDA03)
97
Institute for Molecular Bioscience, University
of Queensland
Write a Comment
User Comments (0)
About PowerShow.com