Pattern-based Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern-based Clustering

Description:

Pattern-based Clustering How to cluster the five objects? Hard to define a global similarity measure What Is Pattern-based Clustering? A cluster: a set of objects ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 33
Provided by: djia3
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Pattern-based Clustering


1
Pattern-based Clustering
  • How to cluster the five objects?
  • Hard to define a global similarity measure

2
What Is Pattern-based Clustering?
  • A cluster a set of objects following the same
    pattern in a subset of dimensions (Wang et al,
    2002)

3
Challenges
  • Most clustering approaches do not address the
    temporal variations in time series gene
    expression data, which is an important feature
    and affect the performance.
  • Previous approaches try to find coherent patterns
    and clusters w.r.t. the entire set of attributes
  • Patterns may be embedded in sub attribute spaces
  • Only a subset of genes participate in any
    cellular processes of interest
  • Any cellular process may take place only in a
    subset of experiment conditions.

a) raw data b) shifting
patterns c) scaling patterns
4
Gene-Sample-Time (GST) Microarray Data
A collection of samples
2D time-series data
  • The GST microarray data consist of three
    dimensions
  • The samples often exhibit various phenotypes,
    e.g., cancer vs. control

3D gene-sample-time data
5
Challenges of Mining GST Data
  • Most clustering algorithms were designed for 2D
    data, and cannot be directly extended for 3D data.

Challenges 2D data 3D data
Mining Process Partition genes Partition genes and samples simultaneously
Cluster model Two types of variables Three types of variables
6
Coherent Gene Cluster
A coherent gene cluster
The 2D representation
A 3D GST data set
  • The group of samples (sj1, sj2, sj3 ) may
    exhibit the same phenotype
  • The group of genes (gi1,gi2,gi3) may be strongly
    correlated to the phenotype shared by (sj1, sj2,
    sj3 )

7
Results from a Real Data Set
  • The Multiple Sclerosis (MS) data consist of
  • 4324 genes
  • 13 MS patients
  • 10 time points before and after IFN-? treatment
  • 25 coherent gene clusters were reported

An example of coherent gene clusters (107 genes,
8 samples)
8
Other Types of Coherent Clusters
9
Problem Definition
  • Given a GST microarray data matrix M, a maximal
    coherent gene cluster C(G?S) is a combination of
    a subset of genes G and a subset of samples S
    such that
  • Coherent the subset of genes G are coherent
    across the subset of samples S
  • Significant Gming, Smins, where ming and
    mins are user-specified parameters
  • Maximal any insertion of g?G or s?S will make C
    not coherent.
  • The problem of mining coherent gene clusters is
    to find the complete set of maximal coherent gene
    clusters in M.

10
Coherence Measure
  • Various coherence measures exist.
  • Measure selection is application dependent.
  • A general coherence model
  • Given a coherence measure sim() and a
    user-specified threshold ?,
  • A gene ga is coherent on samples si and sj, if
    sim(pai,paj) ?.
  • Coherent gene matrix (G1,S1) if every gene gi ?
    G1 is coherent across samples in S1.
  • Trivial coherent gene matrix (gi, sj), (G,
    sj)
  • We choose the Persons correlation coefficient.
  • Other coherence measures are also applicable.

11
Related Work
  • Clustering algorithms on Gene-Sample or
    Gene-Time microarray data
  • The cluster model is completely different
  • Subspace clustering
  • Find subsets of objects coherent with subsets of
    attributes
  • Frequent pattern mining
  • Find subsets of items frequently appearing in
    transaction databases

12
Algorithm Outline
  • Phase 1 (Pre-processing) For each gene g, find
    the complete set of maximal coherent sample sets
    of gene g.
  • Phase 2 Compute the complete set of maximal
    coherent gene clusters based on pre-processing
    results.

13
Coherent Sample Sets
  • Given a gene g, a maximal coherent sample set of
    g is a subset of samples Si such that
  • coherent g is coherent across Si
  • significant Si ? mins
  • maximal there exists no superset S?Si such
    that g is also coherent with S.
  • (g? Si ) is a building block for coherent gene
    clusters including g.

14
Preprocessing Phase
Suppose mins 3
s1 s2 s3 s4 s5 s6
s1 1 1 0 1 0 0
s2 1 1 0 0 0 0
s3 0 0 1 1 1 1
s4 1 0 1 1 1 1
s5 0 0 1 1 1 1
s6 0 0 1 1 1 1
s3,s4,s5,s6 is a coherent sample set of gene g
The coherence matrix of gene g
The coherence graph of gene g
15
Sample-gene Search
  • Set enumeration tree
  • Enumerate all subsets of samples systematically.
  • Each node on the tree corresponds to a subset of
    samples.
  • For each node S
  • Find the maximal set of genes Gs which is
    coherent with S

16
Set Enumeration Tree
The set enumeration tree for a,b,c,d
17
Find the Maximal Coherent Subset of Genes
  • After the pre-processing phase
  • Given a subset of samples S, how to find the
    maximal coherent set of genes GS?
  • Expensive approach scan the table once
  • For each S, Gs can be derived by a single
    scan of the maximal coherent samples of all
    genes. If S ? Sj, g ? Gs.
  • Efficient approach use the inverted list.

g1 s1, s2, s3, s4, s5
g2 s1,s2,s4, s1,s5
g3 s1,s2,s3,s4,s5
g4 s1,s2,s3,s5,s6
g5 s1,s5,s6
18
The Inverted List
Gene Maximal Coherent sample sets
g1 s1, s2, s3, s4, s5
g2 s1, s2, s4, s1, s5
g3 s1, s2, s3, s4, s5
g4 s1, s2, s3, s5, s6
g5 s1, s5, s6
g2.b1
g2.b2
The table of maximal coherent sample sets for
genes
Sample The inverted list
s1 g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1
s2 g1.b1, g2.b1, g3.b1, g4.b1
s3 g1.b1, g3.b1, g4.b1
s4 g1.b1, g2.b1, g3.b1
s5 g1.b1, g2.b2, g3.b1, g4.b2, g5.b1
s6 g4.b2, g5.b1
The table of inverted lists for samples
19
Intersection Instead of Scanning
  • Given a subset of samples Ssi1,,sik,
    intersect the inverted lists of si1,,sik.
  • For example, given Ss1,s2,s3,
    Ls1Ls2Ls3g1.b1,g3.b1,g4.b1, so
    Gsg1,g3,g4.
  • Suppose the parent of S is Ssi1,,sik-1,
    then LSLS ?Lsik.

20
Anti-monotonic Property
  • Given a combination (G?S),
  • if G is not coherent on S,
  • then for any superset S?S, G cannot be coherent
    on S.
  • For any descendant S of S on the tree
  • let GS be the maximal coherent gene set of S,
  • let GS be the maximal coherent gene sets of S,
  • since S?S, we have GS? GS.

21
Pruning Irrelevant Samples
  • Given a subset of samples Ssi1,,sik, a
    sample sj?tails, if
  • j gt ik
  • there exists at least ming genes g such that g
    is coherent with S?sj
  • Samples sl?tails(irrelevant samples) cannot be
    used to extend S.

22
Pruning Unpromising Nodes
  • Given a subset of samples Ssi1,,sik,
  • if Stailslt mins, then prune the subtree of
    S.
  • let the maximal coherent subset of genes of S
    be Gs,
  • if there exists (G?S) such that
  • (S?tails) ? S
  • Gs?G,
  • the prune the subtree of S

23
Determination of Maximal Coherent Gene Clusters
  • The depth-first search strategy
  • For any superset S of S, S is
  • visited before S
  • or a child of S.
  • To determine whether a coherent gene cluster
    (Gs?S) is maximal,
  • check (Gs?S) after visiting all its children,
  • report (Gs?S) if it is not subsumed.

24
Sample The inverted list
s1 g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1
s2 g1.b1, g2.b1, g3.b1, g4.b1
s3 g1.b1, g3.b1, g4.b1
s4 g1.b1, g2.b1, g3.b1
s5 g1.b1, g2.b2, g3.b1, g4.b2, g5.b1
s6 g4.b2, g5.b1

s2 s3,s4
s1 s2,s3,s4,s5
s3
s4
s1,s4 g1.b1, g2.b1, g3.b1
s2,s3 g1.b1, g3.b1, g4.b1
s2,s4 g1.b1, g2.b1, g3.b1
s1,s2 s3,s4 g1.b1, g2.b1, g3.b1, g4.b1
s1,s3 g1.b1, g3.b1, g4.b1
s1,s2,s3 g1.b1,g3.b1,g4.b1
s1,s2,s4 g1.b1,g2.b1,g3.b1
25
Mining Coherent Gene Clusters
  • Systematic enumeration of genes and samples
  • Sample-Gene Search
  • Gene-Sample Search
  • Pruning rules
  • Determination of whether a coherent gene cluster
    (G?S) is maximal

26
Gene-sample Search
Sample-Gene Search Gene-Sample Search
Subjects to enumerate samples genes
Number of subjects to enumerate 101102 103104
Coherent objects Single set of maxmial coherent genes Single or multiple sets of maxmial coherent sample
Efficiency on GST data High Low
27
Experiment Data Sets
  • Real-world gene expression data
  • 4324 genes
  • 13 multiple sclerosis (MS) patients
  • before and at 1,2,4,8,24,48,120 and 168 hours
    after IFN-? treatment
  • Synthetic data
  • Given the number of genes NG, samples NS and
    coherent gene clusters NC
  • Simulate the pre-processing results
  • Embed NC maximal coherent gene clusters (G?S)

28
A Coherent Gene Cluster from Real Data
29
Effect of Parameters
Number of clusters vs. ming (mins3,?0.8)
Number of clusters vs. mins (ming10, ? 0.8)
Number of clusters vs. ? (ming10,mins3)
30
Scalability
Scalability w.r.t. number of genes (number of
samples 30)
Scalability w.r.t. number of samples (number of
genes 3,000)
Scalability of phase 1
31
Conclusion
  • We define the new problem of mining coherent
    gene clusters from the novel gene-sample-time
    microarray data.
  • We propose two approaches the sample-gene
    search and the gene-sample search.
  • We conduct an extensive empirical evaluation on
    both real and synthetic data sets.

32
Future Work
  • New problems from the gene-sample-time
    microarray data
  • Coherent sample clusters (G?S)
  • for each s?S, any pair of genes gi, gj?G has
    coherent patterns.
  • Coherent gene-sample clusters (G?S),
  • both a coherent gene cluster and a coherent
    sample cluster.
Write a Comment
User Comments (0)
About PowerShow.com