?-Clusters Capturing Subspace Correlation in a Large Data Set - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

?-Clusters Capturing Subspace Correlation in a Large Data Set

Description:

... rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too ... Movie. 3. Movie2. 10/4/09. Data Mining: Concepts and Techniques. 15 ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 35

Provided by: xuehu

Category:

more less

Transcript and Presenter's Notes

Title: ?-Clusters Capturing Subspace Correlation in a Large Data Set

1
?-Clusters Capturing Subspace Correlation in a
Large Data Set

Authors Yang Jiong, Wei Wang etc.(ICDE02)
Presenter Xuehua Shen
xshen_at_uiuc.edu

2
Presentation Layout

Overview of Clustering
Related Work of ?-Clusters
?-Clusters Model
FLOC algorithm

3
Clustering

Clustering the process of grouping a set of
objects into classes of similar objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

4
Major Clustering Methods

Partition algorithm
Hierarchy algorithm
Density-based
Grid-based
Model-based

5
Similarity

Clustering the process of grouping a set of
objects into classes of similar objects
But how to define similarity?

6
Similarity cont.

Traditional clustering model based on distance
functions
Some popular ones include Minkowski distance
where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer
But strong correlations may still exist among a
set of objects even if they are far apart from
each other as measured by the distance function

7
Similarity cont.

?-Clusters model similar when exhibiting a
coherent pattern on a subset of dimensions
Can cluster objects which show shifting pattern
or scaling pattern

8
Similarity cont.

Example of Coherent Pattern
Shifting Pattern Scaling
Pattern

9
Subspace Clustering

From high dimensional clustering (problematic)
To subspace clustering
Not restricted with fixed ordering of columns
contrasted with pattern in time-series
data
Challenge curse of dimensionality!

10
Subspace Clustering cont.

Example of subspace clustering

CH11 CH1B CH1D CH2I CH2B
CTFC3 4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14 2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10 538 272 266 277 236
CYS3 322 288 41 278 219
CH11 CH1D CH2B

VPS8 401 120 298
EFB1 318 37 215

CYS3 322 41 219
11
Applications

Microarray Data Analysis in Biology
E-Commerce

12
Microarray Data Analysis

Matrix (Dense)
Rows Genes
Columns Various Samples
experiment conditions or
tissues
Values in Matrix expression level
relative abundance of the mRNA of a gene
under
a specific condition

13
Microarray Data Analysis cont.

From Scaling Pattern to Shifting Pattern
Red Interested Gene, Green Controlled Gene
Investigations show that several genes contribute
to a disease, which motivates researchers to
identify a subset of genes whose expression
levels
rise and fall coherently under a subset of
conditions

14
E-Commerce

Example Rating of Movies (1 lowest rate, 10
highest rate)
Shifting Pattern
If a new movies and 1st viewer rate 7 and 3rd
viewer rate 9, 2nd viewer probably will like this
movie too

Movie1 Movie2 Movie 3 Movie4
Viewer1 1 2 3 6
Viewer2 2 3 4 7
Viewer3 4 5 6 9
15
Presentation Layout

Overview of clustering
Related Work of ?-Clusters
?-Clusters Model
FLOC algorithm

16
Related Work

CLIQUE, ORCLUS, PROCLUS (subspace clustering)
Cant capture neither the shifting pattern nor
the scaling pattern
Bicluster model proposed as a measure of
coherence of genes and conditions in a submatrix
of a DNA array

17
Bicluster

Model Mean squared residue score of submatrix
a submatrix AIJ is called a ?-biCluster if
H(I,J)??
Algorithm A random algorithm to give an
approximate answer

18
Weakness of bicluster

Missing Values
Constraints

19
Presentation Layout

Overview
Related Work
?-Clusters Model
FLOC algorithm

20
Occupancy Threshold

A parameter to control the percentage of missing
values in a submatrix
Ji is the specified attributes for object i in
?-Clusters
J is the number of attributes in the ?-Clusters

21
Occupancy Threshold cont.

Similar occupancy threshold for attribute j in
?-Clusters
Example ?0.6

1 3
4 5
3 4
1 3 3
3 4 5
3 4 4
22
Volume

The volume of a ?-Clusters(I,J) is the number of
specified entries dij in (I,J)
Example
volume is 339

1 3 3
3 4 5
3 4 4
23
Base

Object Base
Attribute Base

24
Base cont.

?-Clusters Base
For perfect ?-Clusters

25
Residue

Entry Residue
if dij is specified
otherwise is 0

26
Residue cont.

?-Clusters Residue
r-residue ?-Clusters if ?-clusters residue is
equal to or smaller than r

27
Presentation Layout

Overview of Clustering
Related Work of ?-Clusters
?-Clusters Model
FLOC algorithm(Flexible Overlapping Clustering)

28
Flow Chart

Generating initial clusters
Determine the best action For each row and
each column
Perform the best action sequentially
improved
29
Initial Cluster

Randomly Generate k initial cluster
Different parameters ? makes different size
cluster

30
Choose best actions

For every object or attribute, there are k
actions which can be done,
Choose the best action among the k candidates
according to gain
Gain is the difference between original residue
and the residue assuming the action is done on
the cluster

31
Choose Best Actions cont.