TRICLUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data - PowerPoint PPT Presentation

Loading...

PPT – TRICLUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data PowerPoint presentation | free to view - id: 795661-NjY2M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

TRICLUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

Description:

TRICLUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 30
Provided by: Zha142
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: TRICLUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data


1
TRICLUSTER An Effective Algorithm for Mining
Coherent Clusters in 3D Microarray Data
Mohammed J. Zaki Lizhuang Zhao Department of
Computer Science, Rensselaer Polytechnic
Institute (RPI), Troy, NY zhaol2,
zaki_at_cs.rpi.edu
2
Microarray Data
  • Essential source of information about the Gene
    Expression within a cell
  • Typically 2D Genes x Samples (Genes x Time)
  • Measure the expression level of genes in
    different samples
  • Labeled samples Classification (cancer vs.
    non-cancer)
  • Non-labeled samples Clustering (Bi-clusters)
  • Goal Identify the expression patterns,
    providing clues to the gene regulatory networks
    within a cell

3
Why Biclustering?
some genes similarly expressed in some samples
Bicluster
full-space cluster
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

v22 v23 v25

v42 v43 v45
v52 v53 v55
g1 g2 g3 g4 g5

v21 v22 v23 v24 v25

v41 v42 v43 v44 v45
v51 v52 v53 v54 v55
g1 g2 g3 g4 g5
(g2, g4, g5)(s2, s3, s5)
(g2, g4, g5)
4
Different Homogeneity or Similarity Criteria
Col
All
Row
2 2 2
2 2 2
2 2 2
1 2 5
1 2 5
1 2 5
1 1 1
2 2 2
5 5 5
  • Constant

more general
Shift0.4
Scale1.4
1.0 1.4 2.0
2.0 2.8 4.0
2.5 3.5 5.0
1.0 1.4 2.0
2.0 2.4 3.0
2.5 2.9 3.5
Scaling/Shifting
Order 2 1 3
4 1 7
3 2 5
6 3 8
Order Preserving
Note small noise ?? is allowed in all expression
values
5
Why TriCluster?
  • Typical microarray data is 2D (gene x sample)
  • Temporal expression very important tool
  • How does gene expression evolve in time?
  • Find clusters over genes x samples x time
  • Spatial expression also of interest
  • How does gene expression differ in space (e.g.,
    different regions of mouse brain)?
  • Find clusters over gene x samples x space
  • Combine temporal and spatial expression
  • Find clusters over gene x time x space, etc.
  • There is an emerging need to mine 3D data

6
TriCluster Our Contributions
  • First algorithm to mine tri-clusters in 3D
    microarray data
  • Complete and deterministic
  • Mine maximal clusters satisfying given
    homogeneity criteria
  • Constant column, row, all
  • Scaling Shifting
  • Clusters can be overlapping optionally
    delete/merge clusters having large overlap
  • Propose a set of metrics for cluster evaluation
  • Use Gene Ontology (GO) to access biological
    significance

7
Definitions
  • G is a set of genes g0, g1, , gn-1
  • S is a set of samples s0, s1, , sm-1
  • T is a set of time courses t0, t1, , tl-1
  • 3D Real-valued Dataset D dijk ? G x S x T
  • dijk is the expression value of gene gi in sample
    sj at time tk
  • triCluster is a maximal submatrix of D that
    satisfies some homogeneity conditions
  • C X x Y x Z cijk
  • X ? G, Y ? S, Z ? T
  • Given homogeneity conditions

8
Scaling triCluster Example
2
2 6 8
4 12 16
10 30 40
Time
4
4 12 16
8 24 32
20 60 80
1
1 3 4
2 6 8
5 15 20
1 2 5
Genes
Ratios
1 3 4
Note small noise ?? is allowed
Samples
9
TriCluster Concepts
  • C X x Y x Z cijk is a triCluster iff
  • C is maximal (no C ? C)
  • C has sufficient size X ? mg, Y ? ms, Z ?
    mt
  • Noise/error threshold ? is satisfied for any C22
  • C22 is an arbitrary 2x2 submatrix
    of C
  • Let ri cia/cib and rj cja/cjb
  • Max(ri/rj) / Min(ri/rj) 1 ? ?
  • Range threshold ?a is satisfied for each dim a
  • ? cijk cxyz
  • If jy, kz, then ? ? ?g (similarly define ?s,
    ?t)

10
TriCluster Flexibility
  • Cluster definition is symmetric
  • Any ordering of dimensions allowed
  • A/CB/D ? A/BC/D ? ADBC
  • Can mine several types of clusters
  • Typically ? ? 0 to allow small noise/error
  • Approx constant cluster ?g ? 0 and ?s ? 0 and ?t
    ? 0
  • Approx single dim constant ?g ? 0 or ?s ? 0 or
    ?t ? 0
  • Approx two dim constant (?g ? 0 and ?s ? 0) or
  • (?g ? 0 and ?t ? 0) or (?s ? 0 and ?t ?
    0)
  • Scaling cluster ?g and ?s and ?t are
    unconstrained
  • Shifting cluster if eC is a scaling C is a
    shifting

T
A C
B D
A B
C D

11
TriCluster Algorithm
  • Compute maximal biclusters on G x S for each time
    slice t ? T
  • Construct range multigraph
  • Find maximal cliques
  • Compute triclusters from biclusters
  • Construct new multigraph (T x biclusters)
  • Find maximal cliques
  • Merge/Prune overlapping clusters

12
Maximal Biclusters
  • Mine each GxS time-slice for maximal biclusters
  • For each pair of samples, get valid ratio ranges
    within e and gene-sets
  • Construct a Range Multigraph
  • Mine maximal cliques
  • Each clique/cluster can contribute to some valid
    tricluster

13
Valid Ratio Ranges Each Column Pair
Range Example
Original Data After
row/col permutation
  • Take ratio s0 and s6 and construct valid ranges
  • Range contains at least mg values within e (noise
    threshold)
  • e0.05, mg3, then 3.0(1e)3.15 ? range 3,
    3.15
  • Other ranges 3.3, 3.465, and so on
  • Construct gene-sets 3, 3.15 has genes g1, g4,
    g8

14
Range Multigraph pair of samples
  • Construct valid ratios gene-sets for s1/s4
  • Ratio 1/1, gene-set g2g6g0g9g7
  • Ratio 5/4, gene-set g4g8g1
  • Construct ratios/gene-sets for other pairs

Multigraph
15
Range Multigraph complete
  • Construct ratios/gene-sets for all sample pairs

16
Maximal Clique Mining
s4
s6
s2
s3
s1
s5
s0
  • Perform recursive depth-first search
  • Maintain valid gene-sets for each node
  • Intersect gene-sets with each outgoing edge
  • g2g6g0g9g7 ? g2g6g0g9 g2g6g0g9
  • Prune if various criteria not met (size, dim
    range)

17
Mine triClusters
  • Let Bt be the set of maximal biclusters for time
    slice t
  • Construct new multigraph
  • Each time point is a vertex
  • Each pair of highly overlapping biclusters
    (gene-set, samples) forms an edge between time ti
    and tj
  • Call maximal clique mining to obtain maximal
    triclusters

18
Constructing triClusters












19
Constructing triClusters




tk




tj
ti




20
Constructing triClusters




tk




tj
ti




21
Prune and Merge
A
Ai
A
B
B
B
Aj
  • Merge A B
  • L(AB)-A-B/ L(AB) lt ?

Prune B LB-A/LB lt ?
Prune B LB- ? A/LB lt ?
  • Cluster Span
  • LC (i,j,k) gi, sj, tk ? C
  • LA?B LA ? LB
  • LA-B LA LB
  • LAB (LA LB) ? (LB LA) ? (LA ? LB)

22
Metrics for Measuring Clustering Quality
  • NumClusters Number of Clusters
  • Span Span (XYZ)XYZ
  • ElementSum Sum of all cluster Spans (count
    multiple times)
  • Coverage Union of all cluster Spans (count once)
  • Overlap (ElementSum - Coverage) / Coverage

We want high coverage with small overlap
23
Synthetic Data Generation
  • Experiments1.4Ghz, 448MB, Linux/Vmware
  • Synthetic data for parameter evaluation
  • Input parameters
  • G4000, S30, T20
  • Number of cluster to embed 10
  • Overlap among clusters 20
  • Noise for expression values 3
  • Cluster size range 150x6x4 (some variation)
  • Generate clusters with values within some range
  • Fill rest of cells with random noise
  • Do random permutations along each dimension
  • We vary one parameter and keep others fixed

24
Results on Synthetic Datasets
Time (sec)
Time (sec)
Time (sec)
Number of Genes
Number of Time-points
Number of Samples
Time (sec)
Time (sec)
Time (sec)
Number of Clusters
Variation ()
Overlap ()
25
Results on Yeast Cell Cycle Dataset
  • http//genome-www.stanford.edu/cellcycle
  • Elutriation Experiment
  • 7679 genes
  • 14 time points (0 to 390mins _at_ 30 min gaps)
  • No real samples use raw expression values of 13
    attributes as samples (Cyc3, Cyc5, ratios, etc)
  • GxSxT 7679 x 13 x 14
  • Note actual 3D data will become publicly
    available soon (e.g. Mouse Brain Atlas genes x
    space x time)
  • Run TriCluster mg50, ms 4, mt 5, e 0.03
  • Found 5 clusters in 28s, overlap0, coverage6250
  • 2D view of cluster C0 (51x4x5) shown next

26
2D Views of cluster C0 on yeast data
t120
sCH2I
sCH2I
t210
sCH2D
sCH2D
t270
Expression Values
Expression Values
Expression Values
sCH2IN
sCH2IN
t330
sCH2DN
sCH2DN
t390
Genes
Genes
Time points
Sample Curves Time Curves
Gene Curves
27
Results on Yeast Cell Cycle DatasetGene Ontology
Cluster Genes Process Function Cellular Location
C0 51 ubiquitin cycle (n3, p0.00346), protein polyubiquitination (n2, p0.00796), carbohydrate biosynthesis (n3, p0.00946)
C1 52 G1/S transition of mitotic cell cycle (n3, p0.00468), mRNA polyadenylylation (n2, p0.00826) protein phosphatase regulator activity (n2,p0.00397) , phosphatase regulator activity (n2, p0.00397)
C2 57 lipid transport (n2, p0.0089) oxidoreductase activity (n7, p0.00239), lipid transporter activity (n2, p0.00627), antioxidant activity (n2, p0.00797) cytoplasm (n41, p0.00052), microsome (n2, p0.00627), vesicular fraction (n2, 0.00627), microbody (n3, p0.00929), peroxisome (n3, p0.00929)
C3 97 physiological process (n76, p0.0017), organelle organization and biogenesis (n15, p0.00173), localization (n21, p0.00537) MAP kinase activity (n2, p0.00209), deaminase activity (n2, p0.00804), hydrolase activity, acting on carbon-nitrogen, but not peptide, bonds (n4, p0.00918), receptor signaling protein serine/threonine kinase activity (n2, p0.00964) membrane (n29, p9.36e-06), cell (n86, p0.0003), endoplasmic reticulum (n13, p0.00112), vacuolar membrane (n6, p0.0015), cytoplasm (n63, p0.00169) intracellular (n79, p0.00209), endoplasmic reticulum membrane (n6, p0.00289), integral to endoplasmic reticulum membrane (n3, p0.00328), nuclear envelope-endoplasmic reticulum network (n6, p0.00488)
C4 66 pantothenate biosynthesis (n2, p0.00246), pantothenate metabolism (n2, p0.00245), transport (n16, p0.00332), localization (n16, p0.00453) ubiquitin conjugating enzyme activity (n2, p0.00833), lipid transporter activity (n2, p0.00833) Golgi vesicle (n2, p0.00729)
Significant (p-value lt 0.01) Shared Gene Ontology
(GO) Terms (Process, Function, Location) for
Genes in Different Clusters
28
Results on Yeast Cell Cycle Specific Cluster
Cluster Genes Process
C3 97 physiological process (n76, p0.0017), organelle organization and biogenesis (n15, p0.00173), localization (n21, p0.00537)
Different clusters show different shared
terms Results could be potentially biologically
significant
29
Summary
  • Contributions
  • First algorithm to mine triclusters from 3D
    microarrays
  • Complete, deterministic
  • Allows small noise
  • Flexible constant, single/two dim, scaling,
    shifting
  • Allows arbitrary overlap (merge/prune)
  • Potentially biologically significant clusters
    (GO)!
  • Future Work
  • Extend from 3-D to k-D datasets
  • Allow different pattern types along different
    axes (scaling along GxS, shifting along T, etc.)
  • Enhance clique mining step from multigraphs
About PowerShow.com