Clustering of Time Course Gene-Expression Data via Mixture Regression Models - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering of Time Course Gene-Expression Data via Mixture Regression Models

Description:

Clustering of Time Course Gene-Expression Data via Mixture Regression Models Geoff McLachlan (joint with Angus Ng and Sam Wang) Department of Mathematics & Institute ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 55
Provided by: kary81
Category:

less

Transcript and Presenter's Notes

Title: Clustering of Time Course Gene-Expression Data via Mixture Regression Models


1
Clustering of Time Course Gene-Expression Data
via Mixture Regression Models
Geoff McLachlan (joint with Angus Ng and Sam
Wang) Department of Mathematics Institute for
Molecular Bioscience University of Queensland
ARC Centre of Excellence
in Bioinformatics
http//www.maths.uq.edu.au/gjm
2
Institute for Molecular Bioscience, University
of Queensland
3
Time-Course Data
Time-course microarray experiments are being
increasingly used to characterize dynamic
biological processes. (Microarray technology
provides the ability to measure the
expression levels of thousands of genes at
once.) In these experiments,
gene-expression levels are measured at different
time points, possibly in different biological
conditions (e.g. treatment-control). The focus
here is on the analysis of gene-expression
profiles consisting of short time series of log
expression ratios for each of the genes
represented on the microarrays.

4
CLUSTERING OF GENE PROFILES
  • can provide new insight into the biological
    proces
  • of interest (coexpressed genes can contribute
    to our
  • understanding of the regulatory network of gene
  • expression).
  • can also assist in assigning functions to genes
    that
  • have not yet been functionally annotated.
  • a secondary concern is the need for imputation
    of
  • missing data

5
The biological rationale underlying the
clustering of microarray data is the fact that
many coexpressed genes are coregulated. It
becomes a way of identifying sets of genes that
are putatively coregulated, thereby
generating testable hypotheses see Boutros and
Okey (2005). It assists with the
functional annotation of uncharacterised genes
the identification of transcription factor
binding sites the elucidation of complete
biological pathways

6
Outline of Talk
  • 1. Mixture model-based approach to analysis of
    gene-expressions
  • 2. Normal Mixtures
  • 3. Modifications for high-dimensional and/or
    structured data
  • Mixtures of linear mixed models
  • Clustering of gene profiles


7
(No Transcript)
8
(No Transcript)
9
Finite Mixture Models
  • Provide an arbitrarily accurate estimate of the
    underlying density with g sufficiently large
  • Provide a probabilistic clustering of the data
    into g clusters - outright clustering by
    assigning a data point to the component to which
    it has the greatest posterior probability of
    belonging.

10
Definition
  • We let Y1,. Yn denote a random sample of size n
    where Yj is a p-dimensional random vector with
    probability density function f (yj)
  • where the f i(yj) are densities and the pi are
    nonnegative quantities that sum to one.

11
  • By Bayes Theorem,
  • for i1,, g j1,,n.  
  • The quantity ti(yjY (k)) is the posterior
    probability that the jth member of the sample
    with observed value yj belongs to the ith
    component of the mixture.

12
A soft (probabilistic) clustering is given in
terms of the estimated posterior probabilities of
component membership A hard (outright)
clustering is given by assigning each yj to the
component to which it has the highest
posterior probability of belonging that is,
given by the where
13
Multivariate Mixture Models
  • Day (Biometrika, 1969)
  • Wolfe (NORMIX, 1965, 1967, 1970)
  • It was the publication of the seminal paper
    of Dempster, Laird, and Rubin (1977) on the EM
    algorithm that greatly stimulated interest in the
    use of finite mixture distributions to model
    heterogeneous data.

14
Multivariate Mixture Models
  • Day (Biometrika, 1969)
  • Wolfe (NORMIX, 1965, 1967, 1970)
  • It was the publication of the seminal paper
    of Dempster, Laird, and Rubin (1977) on the EM
    algorithm that greatly stimulated interest in the
    use of finite mixture distributions to model
    heterogeneous data.
  • Ganesalingam and McLachlan (Biometrika,1978)

15
  • Everitt and Hand (2001)
  • Titterington, Smith, and Makov (1985)

16
  • Everitt and Hand (2001)
  • Titterington, Smith, and Makov (1985)
  • McLachlan and Basford (1988)
  • Lindsay (1996)
  • McLachlan and Peel (2000)
  • Bohning (2000)
  • Fruhwirth-Schnatter (2006)

17
Normal Mixtures
Suppose that the density of the random vector Yj
has a g-component normal mixture form
  • where Y is the vector containing the unknown
    parameters.

18
One attractive feature of adopting mixture models
with elliptically symmetric components, such as
the normal or t densities, is that the implied
clustering is invariant under affine
transformations of the data, i.e., under
operations relating to changes in location,
scale, and rotation of the data. Thus the
clustering process does not depend on irrelevant
factors such as the units of measurement or the
orientation of the clusters in space.
19
Microarray Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
M columns (samples) 102
N rows (genes) 104
Expression Profile
20
Clustering of Microarray Data
  • Clustering of tissues on basis of genes
  • latter is a nonstandard problem in
  • cluster analysis (n M ltlt pN)
  • Clustering of genes on basis of tissues
  • genes (observations) not independent and
  • structure on the tissues (variables) (nN gtgt
    pM)

21
  • The component-covariance matrix Si is highly
    parameterized with p(p1)/2 parameters.

Si s2Ip (equal spherical)
Si si2Ip (unequal spherical)
Si D (equal diagonal)
Si Di (unequal diagonal)
Si S (equal)
22
  • Banfield and Raftery (1993) introduced a
    parameterization of the component-covariance
    matrix Si based on a variant of the standard
    spectral decomposition of Si (i1, ,g).

23
  • However, if p is large relative to the sample
    size n, it may not be possible to use this
    decomposition to infer an appropriate model for
    the component-covariance matrices.
  • Even if it is possible, the results may not be
    reliable due to potential problems with
    near-singular estimates of the component-covarianc
    e matrices when p is large relative to n.

24
  • Hence, in fitting normal mixture models to
    high-dimensional data, we should first consider
  • some form of dimension reduction and/or
  • some form of regularization

25
Mixture Software EMMIX
EMMIX for UNIX
McLachlan, Peel, Adams, and Basford http//www.mat
hs.uq.edu.au/gjm/emmix/emmix.html
26
PROVIDES A MODEL-BASED APPROACH TO
CLUSTERING McLachlan, Bean, and Peel, 2002, A
Mixture Model-Based Approach to the Clustering of
Microarray Expression Data, Bioinformatics 18,
413-422
http//www.bioinformatics.oupjournals.org/cgi/scre
enpdf/18/3/413.pdf
27
(No Transcript)
28
Microarray Data represented as N x M Matrix
Sample 1 Sample 2 Sample
M
Gene 1 Gene 2 Gene N
Expression Signature
M columns (samples) 102
N rows (genes) 104
Expression Profile
29
  • In applying the normal mixture model to
    cluster multivariate
  • (continuous) data, it is assumed as in most
    typical cluster analyses using any other method
    that
  • there are no replications on any particular
    entity specifically identified as such
  • (b) all the observations on the entities
    are independent of one another

30
For example, where and where
31
Clustering of gene expression profiles
  • Longitudinal (with or without replication, for
    example time-course)
  • Cross-sectional data

EMMIX-WIRE EM-based MIXture analysis With Random
Effects
Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng
(2006, Bioinformatics) Supplementary
information http//www.maths.uq.edu.au/gjm/bi
oinf0602_supp.pdf
32
In the ith component of the mixture, the profile
vector yj for the jth gene follows the model
33
N(mi,Bi), with
34
  • Celeux et al. (2005). Mixtures of linear mixed
    models for clustering gene expression profiles
    from repeated microarray measurements.
    Statistical Modelling 5 , 243-267.
  • Qin and Self (2006). The clustering of
    regression models method with applications in
    gene expression data. Biometrics 62, 526-533.
  • Booth et al. (2008). Clustering using objective
    functions and stochastic search. J R Statist Soc
    B 70, 119-139.

35
  • Yeast cell cycle data of Cho et al. (1998)
  • n237 genes at p17 time points
  • categorized into 4 MIPS (Munich Information
    Centre for Protein Sequences) functional groups.
  • The yeast system is useful because of our ability
    to control and monitor the progression of cells
    through the cell cycle (temperature-based
    synchronization with temperature-sensitive genes
    whose product is essential for cell-cycle
    progression).

36
  • High-density oligonucleotide arrays were used to
    quanitate mRNA transcript levels in synchronized
  • yeast cells at regular intervals (10 min) during
    the cell cycle
  • (genes with cell-cycle dependent periodicity).
  • Samples of yeast cultures were taken at 17 time
    points after their cell cycle phase had been
    synchronized.
  • The data were reduced to a short time series of
    log expression ratios for each of the yeast genes
    represented on the microarrays (expression ratios
    were calculated by dividing each intensity
    measurement by the average for that gene.

37
Example . Clustering of yeast cell cycle
time-course data
n 237 genes p 17 time points
where
38
In the ith cluster,
39
Estimated T following Booth et al. (2004)
0, 10, 20,, 160
T is the period estimated to be 73 min.
40
Table 1 Values of BIC for Various Levels of the
Number of Components g
The Number Of Components
2 3 4 5 6 7
10883 10848 10837 10865 10890 10918
41
  • Cluster-specific random effects term

42
Table 2 Summary of Clustering Results for g 4
Clusters
Model Rand Index Adjusted Rand Index Error Rate
1 0.7808 0.5455 0.2910
2 0.7152 0.4442 0.3160
3 0.7133 0.3792 0.4093
Wong 0.7087 0.3697 NA
43
  • The use of the cluster-specific random effects
    terms ci leads to a clustering that
    corresponds more closely to the underlying
    functional groups than without their use.

44
Figure 1 Clusters of gene-profiles obtained by
mixture of linear mixed models with
cluster-specific random effects
45
Figure 2 Clusters of gene-profiles obtained by
mixture of linear mixed models without
cluster-specific random effects
46
Figure 3 Clusters of gene-profiles obtained by
mixtures of linear mixed models with and without
cluster-specific random effects
47
Figure 4 Plots of gene profiles grouped
according to their functional grouping
48
Figure 5 Plots of clustered gene profiles versus
functional grouping
49
Figure 6 Clusters of gene-profiles obtained by
k-means
50
Figure 7 Plots of Clusters of gene-profiles
Model-based clustering versus k-means
51
Another Yeast Cell Cycle Dataset
Spellman (1998 used a-factor (pheromone)
synchronization where the yeast cells were
sampled at 7 minute intervals for 119 minutes
the period of the cell cycle was estimated using
least squares to be T53 min.
52
Example . Clustering of time-course data
n 612 genes p 18 time points
where
53
Clustering Results for Spellman Yeast Cell Cycle
Data
54
Mixtures of linear mixed models
  • Useful in modelling biological processes that
    exhibit periodicity at
  • different temporal scales (not restricted to
    cell cycle data e.g
  • changes in core body temperature, heart rate,
    blood pressure).
  • In summary, they provide a flexible tool to
    cluster high-
  • dimensional data (which may be correlated and
    structured)
  • for a wide range of experimental designs, e.g.
  • - longitudinal data (with or without
    replication)
  • - cross sectional data (multiple samples at
    one time point).
  • Provide an integrated framework for the analysis
    of
  • microarray data by incorporating experimental
    design
  • information and (biological or clinical)
    covariates.
Write a Comment
User Comments (0)
About PowerShow.com