Clustering of Time Course Gene-Expression Data via Mixture Regression Models - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering of Time Course Gene-Expression Data via Mixture Regression Models

Description:

Clustering of Time Course Gene-Expression Data via Mixture Regression Models Geoff McLachlan (joint with Angus Ng and Sam Wang) Department of Mathematics & Institute ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 55

Provided by: kary81

Category:

more less

Transcript and Presenter's Notes

Title: Clustering of Time Course Gene-Expression Data via Mixture Regression Models

1
Clustering of Time Course Gene-Expression Data
via Mixture Regression Models
Geoff McLachlan (joint with Angus Ng and Sam
Wang) Department of Mathematics Institute for
Molecular Bioscience University of Queensland
ARC Centre of Excellence
in Bioinformatics
http//www.maths.uq.edu.au/gjm
2
Institute for Molecular Bioscience, University
of Queensland
3
Time-Course Data
Time-course microarray experiments are being
increasingly used to characterize dynamic
biological processes. (Microarray technology
provides the ability to measure the
expression levels of thousands of genes at
once.) In these experiments,
gene-expression levels are measured at different
time points, possibly in different biological
conditions (e.g. treatment-control). The focus
here is on the analysis of gene-expression
profiles consisting of short time series of log
expression ratios for each of the genes
represented on the microarrays.

4
CLUSTERING OF GENE PROFILES

can provide new insight into the biological
proces
of interest (coexpressed genes can contribute
to our
understanding of the regulatory network of gene
expression).
can also assist in assigning functions to genes
that
have not yet been functionally annotated.
a secondary concern is the need for imputation
of
missing data

5
The biological rationale underlying the
clustering of microarray data is the fact that
many coexpressed genes are coregulated. It
becomes a way of identifying sets of genes that
are putatively coregulated, thereby
generating testable hypotheses see Boutros and
Okey (2005). It assists with the
functional annotation of uncharacterised genes
the identification of transcription factor
binding sites the elucidation of complete
biological pathways

6
Outline of Talk

1. Mixture model-based approach to analysis of
gene-expressions
2. Normal Mixtures
3. Modifications for high-dimensional and/or
structured data
Mixtures of linear mixed models
Clustering of gene profiles

7
(No Transcript)
8
(No Transcript)
9
Finite Mixture Models

Provide an arbitrarily accurate estimate of the
underlying density with g sufficiently large
Provide a probabilistic clustering of the data
into g clusters - outright clustering by
assigning a data point to the component to which
it has the greatest posterior probability of
belonging.

10
Definition

We let Y1,. Yn denote a random sample of size n
where Yj is a p-dimensional random vector with
probability density function f (yj)
where the f i(yj) are densities and the pi are
nonnegative quantities that sum to one.

By Bayes Theorem,
for i1,, g j1,,n.

The quantity ti(yjY (k)) is the posterior
probability that the jth member of the sample
with observed value yj belongs to the ith
component of the mixture.

12
A soft (probabilistic) clustering is given in
terms of the estimated posterior probabilities of
component membership A hard (outright)
clustering is given by assigning each yj to the
component to which it has the highest
posterior probability of belonging that is,
given by the where
13
Multivariate Mixture Models

Day (Biometrika, 1969)
Wolfe (NORMIX, 1965, 1967, 1970)
It was the publication of the seminal paper
of Dempster, Laird, and Rubin (1977) on the EM
algorithm that greatly stimulated interest in the
use of finite mixture distributions to model
heterogeneous data.

14
Multivariate Mixture Models

Day (Biometrika, 1969)
Wolfe (NORMIX, 1965, 1967, 1970)
It was the publication of the seminal paper
of Dempster, Laird, and Rubin (1977) on the EM
algorithm that greatly stimulated interest in the
use of finite mixture distributions to model
heterogeneous data.
Ganesalingam and McLachlan (Biometrika,1978)

Everitt and Hand (2001)
Titterington, Smith, and Makov (1985)

Everitt and Hand (2001)
Titterington, Smith, and Makov (1985)
McLachlan and Basford (1988)
Lindsay (1996)
McLachlan and Peel (2000)
Bohning (2000)
Fruhwirth-Schnatter (2006)

17
Normal Mixtures
Suppose that the density of the random vector Yj
has a g-component normal mixture form

where Y is the vector containing the unknown
parameters.

18
One attractive feature of adopting mixture models
with elliptically symmetric components, such as
the normal or t densities, is that the implied
clustering is invariant under affine
transformations of the data, i.e., under
operations relating to changes in location,
scale, and rotation of the data. Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
19
Microarray Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
M columns (samples) 102
N rows (genes) 104
Expression Profile
20
Clustering of Microarray Data

Clustering of tissues on basis of genes
latter is a nonstandard problem in
cluster analysis (n M ltlt pN)
Clustering of genes on basis of tissues
genes (observations) not independent and
structure on the tissues (variables) (nN gtgt
pM)

The component-covariance matrix Si is highly
parameterized with p(p1)/2 parameters.

Si s2Ip (equal spherical)
Si si2Ip (unequal spherical)
Si D (equal diagonal)
Si Di (unequal diagonal)
Si S (equal)
22

Banfield and Raftery (1993) introduced a
parameterization of the component-covariance
matrix Si based on a variant of the standard
spectral decomposition of Si (i1, ,g).

However, if p is large relative to the sample
size n, it may not be possible to use this
decomposition to infer an appropriate model for
the component-covariance matrices.
Even if it is possible, the results may not be
reliable due to potential problems with
near-singular estimates of the component-covarianc
e matrices when p is large relative to n.

Hence, in fitting normal mixture models to
high-dimensional data, we should first consider
some form of dimension reduction and/or
some form of regularization

25
Mixture Software EMMIX
EMMIX for UNIX
McLachlan, Peel, Adams, and Basford http//www.mat
hs.uq.edu.au/gjm/emmix/emmix.html
26
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
27
(No Transcript)
28
Microarray Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
M columns (samples) 102
N rows (genes) 104
Expression Profile
29

In applying the normal mixture model to
cluster multivariate
(continuous) data, it is assumed as in most
typical cluster analyses using any other method
that
there are no replications on any particular
entity specifically identified as such
(b) all the observations on the entities
are independent of one another

30
For example, where and where
31
Clustering of gene expression profiles

Longitudinal (with or without replication, for
example time-course)
Cross-sectional data

EMMIX-WIRE EM-based MIXture analysis With Random
Effects
Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng
(2006, Bioinformatics) Supplementary
information http//www.maths.uq.edu.au/gjm/bi
oinf0602_supp.pdf
32
In the ith component of the mixture, the profile
vector yj for the jth gene follows the model
33
N(mi,Bi), with
34

Celeux et al. (2005). Mixtures of linear mixed
models for clustering gene expression profiles
from repeated microarray measurements.
Statistical Modelling 5 , 243-267.
Qin and Self (2006). The clustering of
regression models method with applications in
gene expression data. Biometrics 62, 526-533.
Booth et al. (2008). Clustering using objective
functions and stochastic search. J R Statist Soc
B 70, 119-139.

Yeast cell cycle data of Cho et al. (1998)
n237 genes at p17 time points
categorized into 4 MIPS (Munich Information
Centre for Protein Sequences) functional groups.
The yeast system is useful because of our ability
to control and monitor the progression of cells
through the cell cycle (temperature-based
synchronization with temperature-sensitive genes
whose product is essential for cell-cycle
progression).

High-density oligonucleotide arrays were used to
quanitate mRNA transcript levels in synchronized
yeast cells at regular intervals (10 min) during
the cell cycle
(genes with cell-cycle dependent periodicity).
Samples of yeast cultures were taken at 17 time
points after their cell cycle phase had been
synchronized.
The data were reduced to a short time series of
log expression ratios for each of the yeast genes
represented on the microarrays (expression ratios
were calculated by dividing each intensity
measurement by the average for that gene.

37
Example . Clustering of yeast cell cycle
time-course data
n 237 genes p 17 time points
where
38
In the ith cluster,
39
Estimated T following Booth et al. (2004)
0, 10, 20,, 160
T is the period estimated to be 73 min.
40
Table 1 Values of BIC for Various Levels of the
Number of Components g
The Number Of Components
2 3 4 5 6 7
10883 10848 10837 10865 10890 10918
41

Cluster-specific random effects term

42
Table 2 Summary of Clustering Results for g 4
Clusters
Model Rand Index Adjusted Rand Index Error Rate
1 0.7808 0.5455 0.2910
2 0.7152 0.4442 0.3160
3 0.7133 0.3792 0.4093
Wong 0.7087 0.3697 NA
43

The use of the cluster-specific random effects
terms ci leads to a clustering that
corresponds more closely to the underlying
functional groups than without their use.

44
Figure 1 Clusters of gene-profiles obtained by
mixture of linear mixed models with
cluster-specific random effects
45
Figure 2 Clusters of gene-profiles obtained by
mixture of linear mixed models without
cluster-specific random effects
46
Figure 3 Clusters of gene-profiles obtained by
mixtures of linear mixed models with and without
cluster-specific random effects
47
Figure 4 Plots of gene profiles grouped
according to their functional grouping
48
Figure 5 Plots of clustered gene profiles versus
functional grouping
49
Figure 6 Clusters of gene-profiles obtained by
k-means
50
Figure 7 Plots of Clusters of gene-profiles
Model-based clustering versus k-means
51
Another Yeast Cell Cycle Dataset
Spellman (1998 used a-factor (pheromone)
synchronization where the yeast cells were
sampled at 7 minute intervals for 119 minutes
the period of the cell cycle was estimated using
least squares to be T53 min.
52
Example . Clustering of time-course data
n 612 genes p 18 time points
where
53
Clustering Results for Spellman Yeast Cell Cycle
Data
54
Mixtures of linear mixed models