Title: ?-Clusters Capturing Subspace Correlation in a Large Data Set
1?-Clusters Capturing Subspace Correlation in a
Large Data Set
- Authors Yang Jiong, Wei Wang etc.(ICDE02)
- Presenter Xuehua Shen
- xshen_at_uiuc.edu
2Presentation Layout
- Overview of Clustering
- Related Work of ?-Clusters
- ?-Clusters Model
- FLOC algorithm
3Clustering
- Clustering the process of grouping a set of
objects into classes of similar objects -
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
-
4Major Clustering Methods
- Partition algorithm
- Hierarchy algorithm
- Density-based
- Grid-based
- Model-based
5Similarity
- Clustering the process of grouping a set of
objects into classes of similar objects - But how to define similarity?
6Similarity cont.
- Traditional clustering model based on distance
functions - Some popular ones include Minkowski distance
- where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer - But strong correlations may still exist among a
set of objects even if they are far apart from
each other as measured by the distance function
7Similarity cont.
- ?-Clusters model similar when exhibiting a
coherent pattern on a subset of dimensions - Can cluster objects which show shifting pattern
or scaling pattern
8Similarity cont.
- Example of Coherent Pattern
- Shifting Pattern Scaling
Pattern
9Subspace Clustering
- From high dimensional clustering (problematic)
To subspace clustering - Not restricted with fixed ordering of columns
contrasted with pattern in time-series
data - Challenge curse of dimensionality!
10Subspace Clustering cont.
- Example of subspace clustering
CH11 CH1B CH1D CH2I CH2B
CTFC3 4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14 2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
CH11 CH1D CH2B
VPS8 401 120 298
EFB1 318 37 215
CYS3 322 41 219
11Applications
- Microarray Data Analysis in Biology
- E-Commerce
12 Microarray Data Analysis
- Matrix (Dense)
- Rows Genes
- Columns Various Samples
- experiment conditions or
tissues - Values in Matrix expression level
- relative abundance of the mRNA of a gene
under - a specific condition
13Microarray Data Analysis cont.
- From Scaling Pattern to Shifting Pattern
- Red Interested Gene, Green Controlled Gene
- Investigations show that several genes contribute
- to a disease, which motivates researchers to
- identify a subset of genes whose expression
levels - rise and fall coherently under a subset of
conditions -
14E-Commerce
- Example Rating of Movies (1 lowest rate, 10
highest rate) - Shifting Pattern
- If a new movies and 1st viewer rate 7 and 3rd
viewer rate 9, 2nd viewer probably will like this
movie too -
Movie1 Movie2 Movie 3 Movie4
Viewer1 1 2 3 6
Viewer2 2 3 4 7
Viewer3 4 5 6 9
15Presentation Layout
- Overview of clustering
- Related Work of ?-Clusters
- ?-Clusters Model
- FLOC algorithm
16Related Work
- CLIQUE, ORCLUS, PROCLUS (subspace clustering)
- Cant capture neither the shifting pattern nor
the scaling pattern - Bicluster model proposed as a measure of
coherence of genes and conditions in a submatrix
of a DNA array
17 Bicluster
- Model Mean squared residue score of submatrix
-
- a submatrix AIJ is called a ?-biCluster if
H(I,J)?? - Algorithm A random algorithm to give an
approximate answer
18Weakness of bicluster
- Missing Values
- Constraints
19Presentation Layout
- Overview
- Related Work
- ?-Clusters Model
- FLOC algorithm
20Occupancy Threshold
- A parameter to control the percentage of missing
values in a submatrix - Ji is the specified attributes for object i in
?-Clusters - J is the number of attributes in the ?-Clusters
-
21Occupancy Threshold cont.
- Similar occupancy threshold for attribute j in
?-Clusters - Example ?0.6
1 3
4 5
3 4
1 3 3
3 4 5
3 4 4
22Volume
- The volume of a ?-Clusters(I,J) is the number of
specified entries dij in (I,J) - Example
- volume is 339
1 3 3
3 4 5
3 4 4
23Base
- Object Base
- Attribute Base
24Base cont.
- ?-Clusters Base
- For perfect ?-Clusters
25Residue
- Entry Residue
- if dij is specified
- otherwise is 0
26Residue cont.
- ?-Clusters Residue
- r-residue ?-Clusters if ?-clusters residue is
equal to or smaller than r
27Presentation Layout
- Overview of Clustering
- Related Work of ?-Clusters
- ?-Clusters Model
- FLOC algorithm(Flexible Overlapping Clustering)
28Flow Chart
Generating initial clusters
Determine the best action For each row and
each column
Perform the best action sequentially
improved
29Initial Cluster
- Randomly Generate k initial cluster
- Different parameters ? makes different size
cluster
30Choose best actions
- For every object or attribute, there are k
actions which can be done, - Choose the best action among the k candidates
according to gain - Gain is the difference between original residue
and the residue assuming the action is done on
the cluster
31Choose Best Actions cont.
- Even if gain is negative sometimes
- we do the action in order to get the global
optimum
32Do the actions sequentially
- Generate the actions sequence
- 1) the same order in all iterations
- 2) random order sequence
- 3) weighted random order sequence
33 Output the Best cluster
- After some iterations, no improvement of minimum
residue, algorithm stops and k best cluster is
output
34End