On the Theory and Practice of Co-clustering - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

On the Theory and Practice of Co-clustering

Description:

2000: First use in Bioinformatics: Bi-Clustering (Cheng & Church, AAAI) ... in Bioinformatics. 2000, 4 ... Types of co-clusters in Bioinformatics (1) ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 24
Provided by: vikasa
Category:

less

Transcript and Presenter's Notes

Title: On the Theory and Practice of Co-clustering


1
On the Theory and Practice of Co-clustering
  • Srujana Merugu, Gunjan Gupta, Joydeep Ghosh

2
Overview
  • What is co-clustering ?
  • Why Co-clustering on Gene Expression data ?
  • Defining quality measures for co-clusters.
  • Structural types of co-clusters.
  • Algorithmic strategies.

3
The dual problem in Clustering
  • Often, we can cluster points based on features,
    or features based on points. For example
  • Market basket data customers vs. products.
  • Web documents vs. words.
  • Genes gene expressions vs. experimental
    conditions.
  • Good clusters of documents depend on relevant
    words, but ..

.. the choice of relevant words depends on
the documents being clustered.
4
Can we simultaneously cluster/optimize for both?
  • A brief history ..
  • 1965 Problem first described formally by I. J.
    Good
  • 1972 First solution Direct Clustering (J. A.
    Hartigan, JASA)
  • 2000 First use in Bioinformatics Bi-Clustering
    (Cheng Church, AAAI)

5
Gene Expression Data from DNA Micro Arrays
Condition 1. Condition J . Condition m
Gene 1 A11 A1J A1M
Gene . . . .
Gene I AI1 AIJ AIM
Gene . . . .
Gene N AN1 ANJ ANM
6
An Example Genes vs. individual cancer samples
23,100x67 matrix from 67 tumors, 17,108
genes. Source PNAS, David Botstein, 2001
7
Objectives when analyzing Gene Expression Data
Identifying a new metabolic processes/regulation,
or identifying genes involved in a process by say-
Grouping genes by similar expression under some
conditions
Classification of a new gene G4 showing similar
expressions
C1 C2 C5 C6
conditions
G1
C1 C2 C5 C6
G1
G2
process X
G3
G2
genes
process X, labeled by a biologist
G3
G4
  • Similarly we can solve the dual problem
    finding, that a new condition C7 also involves
    the identified biological process X.

8
Important Point Many-to-many and simultaneous
mappingof both genes and conditions to Metabolic
Processes (clusters)
A subset of Genes under a specific subset of
Conditions form one metabolic process.
traditional clustering and
partitioning are mutually exclusive. Cannot find
both Process 1 and 2 simultaneously.
9
Summary Why Gene Expression data and
Co-clustering ?
  • Overlapping clusters natural to many
    co-clustering techniques
  • Only subsets of conditions/genes may be catered
    to
  • Good for Sparse data
  • Noisy data
  • Hard to control all experimental conditions
    (origin of cells, temperature, osmotic stress,
    centrifuge process etc.)
  • smoothing

10
Co-clustering publications in Bioinformatics
At least 14 papers since Cheng Churchs paper
in 2000, ALL on Gene expression data, including
  • 2000, 4 papers
  • Cheng Church (yeast human microarray)
  • Califano,Stolovitzky Tu (phenotype
    classification gene microarray data)
  • Getz, Levine Domany (gene expression data on
    cancer)
  • Lazzeroni Owen (yeast gene expression data)
  • 2001, 2 papers
  • 2002, 3 papers including
  • Ben-Dor et al. (breast tumor set, gene expression
    data)
  • 2003, 5 papers, all on cancer or yeast gene
    expression data
  • 2004 1, yeast and cancer gene expression data
  • Cho, Dhillon, Guan Sra, Minimum Sum-Squared
    Residue Co-clustering of Gene Expression Data

11
Types of co-clusters in Bioinformatics (1)
12
Types of co-clusters in Bioinformatics (2)
13
A model for various types of co-clusters
Models for discovering co-cluster set a(I,J), I?
(1..N), J? (1..M), for a set of N genes (rows)
and M conditions (columns) of type A,B,C,D and E

Additive model (1) or
Multiplicative model (2)
   
Global mean
Adjustment for column j? J
Adjustment for row i? I
14
Types of Co-clusters in Bioinformatics(1)
15
Types of Co-clusters in Bioinformatics (2)
Green and red coherently regulated, even if out
of phase
F. Coherent Evolutionary(row, column, or both
axes)
10
19
13
70
35
49
40
29
Example, coherent co-evolution on columns
15
27
20
40
12
20
15
90
16
Co-clustering with Noise
A noise term can be introduced. Cheng Church
introduced one such concept called the Mean
Squared Residue. The residue of element (i, j)
is defined as
Co-cluster mean
noisy value
row i mean
column j mean
and
Then the Mean Squared Residue of co-cluster (I,J)
is given by
Minimize this for a good co-cluster
17
Flavors of co-clustering structures and their
relevance to Gene Array data
Various algorithms, irrespective of the
similarity type, discover one or more of the
following co-cluster structures, in increasing
order of generality
Single
Exclusive row column
Checkerboard
Exclusive row, or Exclusive column
Non-overlapping, non-exclusive
Non-overlapping, tree structure
Overlapping with hierarchical structure
Arbitrarily positioned overlapping
18
Various Algorithmic approaches for searching
optimal co-clusters
  • Iterative row and column clustering combination
  • CTWC (Coupled Two Way Clustering), ITWC
    (Interrelated Two-Way Clustering), Double
    Conjugate Clustering (DCC) and more ..
  • One example CTWC
  • Partition separately rows in R clusters,columns C
    clusters using T as annealing to control size.
    Lower T results in smaller clusters..
  • Co-clusters defined as (R,C) tuples . Prune
    unstable (R,C). Defined as clusters that broke up
    fast within a short T interval. Stop if large
    stable clusters not in the pruned (R,C) set any
    more.
  • Decrease T, Repeat Step 1 on each (R,C) tuple,
    keeping track of the parent tuple.
  • Divide-and-conquer
  • Greedy iterative search add remove points based
    on local gain, for example Cheng Church using
    Mean Squared Residue
  • Exhaustive enumeration not exponential by
    limiting the number of columns a gene activity is
    non-zero - relevant to gene expression.
  • Distribution parameter identification assume a
    parametric distribution of clusters and search
    for the parameters for the K clusters.

19
Another way of looking at Co-clustering a
Bipartite graph
Example Samba, using exhaustive enumeration for a
limited degree of a gene node.
Minimize weighted/binary edge cut to produce
heaviest bipartite sub-graphs to get co-clusters.
Can find overlapping sub-graphs.
20
To Summarize existing work, and some comments ..
  • Coherent additive or coherent co-evolutionary
    approaches are quite popular and adaptive enough
    for biology.
  • All co-clustering papers in bio-informatics have
    looked at 2-D gene array data.
  • Quality measures diverse and vary substantially
    by algorithm, only some of them relevant to
    gene-expression data. Choose carefully.
  • Clustering structures diverse and vary
    substantially by algorithm, only some of them
    relevant to gene-expression data. Choose
    carefully.
  • But .. More flexible and overlapping clustering
    topologies have higher computational complexity
    can we put lower bounds on these ?

21
Discovering coherent co-evolution co-clusters
  • Co-clusters that form gene groups that have
    coherent behavior, even if out of phase if gene
    A vs. B show opposite expression trend (if one
    increases the other one decreases), then they
    show opposite trend for all columns. See Figure
    F.
  • Various kinds of co-evolutionary similarity
    methods for co-clusters
  • Ben-Dor et al. Order Preserving Sub-Matrix
    same ordering across all columns (see example F)
  • Similarly, Lui Wang Order Preserving Cluster
  • Murali Kasif State preserving xMOTIFs all
    genes in same state in a row.
  • SAMBA two state expression of genes across all
    conditions in the co-cluster.
  • Cho, Dhillion et al. sum-squared residue
    coherent clusters.

   
22
Results Sparsity (Dhillon, 2003)
Before Co-clustering
(light regions are zeros)
23
Results Sparsity (Dhillon, 2003)
After Co-clustering
(light regions are zeros)
Large empty regions not relevant to co-clustering
get separated out
Write a Comment
User Comments (0)
About PowerShow.com