Clustering gene expression profiles following Chinese restaurant process - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Clustering gene expression profiles following Chinese restaurant process

Description:

Clustering gene expression profiles following Chinese restaurant ... Hubert and Arable 1985. Ranges (0, 1). 0 random 1 perfect match. Yeung and Ruzzo 2001. ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 43
Provided by: sphU
Category:

less

Transcript and Presenter's Notes

Title: Clustering gene expression profiles following Chinese restaurant process


1
Clustering gene expression profiles following
Chinese restaurant process
  • Steve Qin
  • Bioinformatics workshop

2
Background
3
Goal
  • Partition the data, such that
  • Data within classes are similar.
  • Classes are different among themselves.

4
Popular clustering methods
5
Challenges
  • Simple and model free.
  • Depend on distance definition.
  • Less flexible in handling noise and missing data.
  • Expensive to compute for large dataset.
  • No probabilistic foundation, hard to do inference
    on the result.

6
Model based clustering
  • Finite mixture model
  • Mclachlan and Basford, 1988.
  • Banfield and Raftery, 1993.
  • K is determined using BIC, then clustering is
    performed conditional on K using EM algorithm.

7
Discussion
  • Sound probabilistic foundation, able to handle
    large dataset.
  • Separate clustering from
  • estimating cluster size.

8
Dirichlet process mixture model
  • Infinite mixture model.
  • Do not require K a priori.
  • Chinese restaurant process.
  • Due to Dubins and Pitman.
  • Aldous (1985), Pitman (1996).

9
Chinese restaurant process


10
Chinese restaurant process
The probability of joining these tables


11
About DP
  • Let (?,?) be a measurable space, G0 be a
    probability measure on the space, and ? be a
    positive real number
  • A Dirichlet process is any distribution of a
    random probability measure G over (?,?) such
    that, for all finite partitions (A1,,Ar) of ?,
  • Draws G from DP are generally not distinct
  • The number of distinct values grows with O(log n)

12
Exchangeable
  • In general, an infinite set of random variables
    is said to be infinitely exchangeable if every
    finite subset xi,,xn is exchangeable.
  • Using DeFinettis theorem, it is possible to show
    that our draws ? are infinitely exchangeable
  • Thus the mixture components may be sampled in any
    order

13
General scheme
14
Dirichlet process
  • G DP(a, G0)
  • G0 is continuous, so the probability that any
    two samples are equal is precisely zero.
  • However, G is a discrete distribution, made up of
    a countably infinite number of point masses
    Blackwell
  • Therefore, there is always a non-zero
    probability of two samples colliding

15
Dirichlet process
16
History
  • Polya urn process.
  • Stick breaking.
  • Infinite mixture model.
  • Bayesian nonparametric model
  • Historical references
  • Ferguson 1973.
  • Blackwell and McQueen 1973.
  • Antoniak 1974.

17
References
  • FMM
  • McLachlan et al. 2002, Yeung et al. 202.
  • IMM
  • Medvedovic and Sivaganesan 2002.
  • Yeung et al 2003, Medvedovic 2004.
  • Tadesse et al. 2005, Kim et al. 2006.

18
Notation
  • N number of genes
  • M number of experiments
  • K number of clusters (unknown).
  • Xxij expression profile.
  • EE(i) indicator of cluster membership.

19
Model
  • Gene profiles within a cluster follow the same
    set of Gaussian distributions.
  • Likelihood

20
Marginal likelihood
  • Conjugate priors

Integrating out nuisance parameters. Predictive
updating Liu 1994. Chen and Liu 1995.
21
Posterior inference
  • Prior
  • Posterior

Weighted Chinese restaurant process Lo 2005.
22
Algorithm
  • Initialization
  • randomly assign genes into an arbitrary number
    of K0 clusters 1 K0 N.
  • For each gene i, perform the following
    reassignment
  • Remove gene i from its current cluster, given the
    current assignment of all the other genes,
    calculate the probability of this gene joining
    each of the existing cluster as well as being
    alone.
  • Assign gene i to the K 1 possible clusters
    according to probabilities. Update indicator
    variable E(i) based on the assignment.
  • Repeat the above two steps for every gene, and
    repeat for a large number of rounds until
    convergence.

23
Correlations
24
Add a model selection step
Try to fit different versions of this vector to
all clusters.
25
Remarks (I)
  • For each gene, provide posterior probability for
    joining its current cluster.
  • For each cluster, provide likelihood ratio to
    measure its tightness.
  • For each pair of cluster, provide log likelihood
    ratio as a distance measure. Can draw a
    dendrogram for all clusters.

26
Remarks (II)
  • Assume all experiments are independent.
  • Pathetic, but still works.
  • Can easily add covariance structure if needed,
    e.g., for time course data.
  • Tolerate sporadic missing data.
  • If data missing from an experiment, give equal
    probability to join each existing clusters in
    this experiment.

27
Remarks (III)
  • Choice of prior
  • Data dependent , a 0.5, b
    2sd(x).
  • Fix tuning parameter a 1. higher a will produce
    more clusters.
  • Start from clusters.
  • Run 20 parallel chains, each go through 100
    cycles.

28
Simulation study
  • 400 genes, 20 experiments, 5 clusters.

K 1, 2, 3.
29
Trace plots
30
Adjusted Rand Index
  • Hubert and Arable 1985.
  • Ranges (0, 1).
  • 0 random 1 perfect match.

Yeung and Ruzzo 2001.
31
Results
Hierarchical clustering on the complex dataset is
57.4.
32
Galactose dataset
  • Microarrays were used to measure the mRNA
    expression profiles of yeast growing under 20
    different perturbations to the GAL pathway.
    Ideker et al, 2001.
  • 205 genes whose expression patterns reflect 4 GO
    functional categories.
  • 4 replicates.
  • 8 missing data, imputed by knn (k12)
    approach. Troyanskaya et al. 2002.

33
Trace plot
34
Results
35
Results
GIMM with replicates 84.69 (4), 95.29 (5), 95.01
(6). No replicate 56 67.
36
Discussion
  • GIMM
  • hierarchical model,
  • Model covariance,
  • Model replicates,
  • Measure frequency of co-occurrences, perform
    hierarchical clustering.
  • This algorithm
  • Independent model,
  • Predictive updating,
  • Allow missing data,
  • Allow complex relationships,
  • No distance defined.

37
Discussion
  • Distance based clustering relies on gene-gene
    comparison, O(n2), model based clustering perform
    gene-cluster comparison, O(nlog(n)m), more
    efficient for large dataset.
  • Combine cluster number estimation and actual
    clustering in one unified process seems to be
    advantageous.

38
Robustness of cluster size Correlation between
different K
39
Discussion
  • Distance based clustering are more vulnerable to
    adverse complications, such as missing data,
    substantial noise. E.g., if data from one
    experiment is corrupted.

40
Limitations
  • Currently, each gene belongs to one cluster. In
    reality, one gene can participated in multiple
    pathways.
  • Magnitude of data dominate clustering decision.
    Trend should be an important factor for
    consideration.

41
Acknowledgement
  • Michael Elliott
  • Debashis Ghosh
  • Mario Medvedovic

42
Thank You
Write a Comment
User Comments (0)
About PowerShow.com