CACTUS - PowerPoint PPT Presentation

About This Presentation

Title:

CACTUS

Description:

Number of Views:125

Avg rating:3.0/5.0

Slides: 20

Provided by: Rong81

Category:

Tags: cactus | clustering | data

Transcript and Presenter's Notes

Title: CACTUS

1
CACTUS Clustering Categorical Data Using
Summaries

2
Overview

3
Introduction and motivation

Numeric data, 1,2,3,4,5,
Categorical data, LFD, PMR, DME
Usually small number of attribute values in their
domains. Large domains are typically hard to
infer useful information
Use relations! Relations contain different
attributes, but the cross product of domain
attributes can be large.
CACTUS a fast summarisation-based algorithm
which uses summary information to find well-fined
clusters.

4
Existing tools for clustering categorical data

STIRR
Each attribute value is represented as a weighted
vertex in a graph.
Multiple copies b1,,bm (basins) of weighted
vertices are maintained. They can have different
weights.
Starting Step a set of weights on all vertices
in all basins.
Iterative Step Increment the weight in basin bi
on vertex tj, for each vertices tuple tltt1, ,
tngt in bi, using a function combining the weights
of vertices other than tj in bi.
At fixed point the large positive weights and
small negative weights across the basins isolate
two groups of attribute values on each attribute.
ROCK
Starts with each tuple in its own cluster.
Merges close clusters until a required number
(user specified) of clusters remains. Closeness
defined by a similarity function.
Use STIRR to compare with CACTUS.

5
Definitions Interval region, support and
belonging

A1,,An is a set of categorical attributes with
domains D1,,Dn respectively. D is a set tuples
where each tuple t ? 1,,n.
Interval region SS1 X X Sn if Si subset of Di
for all i ? 1,,n. Equivalent to intervals in
numeric data
The support of a value pair sD(ai,aj)t?Dt.Aia
i t.Ajaj/D. The support of a region sD(S)
is the number of tuples in D contained in S
Belonging A tuple tltt.A1,t.Angt ? D belongs to
a region S if for all t ? 1,,n, t.Ai ? Si.

6
Definitions expected support, strongly connected

The expected support under attribute-independence
assumption,
Of a region EsD(S) DS1XXSn/D1XXDn
Of a pair ai and aj EsD(ai,aj)
aD/DiXDj
a is normally set to 2 or 3
Strongly Connected
ai and aj if sD(ai,aj)gtEsD(ai,aj),
sD(ai,aj)sD(ai,aj) Otherwise, 0.
ai ? Si w.r.t Sj for all x ? Sj, ai and x are
strongly connected.
Si and Sj if each ai ? Si is strongly connected
with each aj ? Sj and if each aj ? Sj is strongly
connected with each ai ? Si.

7
Definitions Cluster, Cluster-projection,
sub-cluster and subspace cluster

CltC1,..Cngt is a cluster over A1,,An if
1. Ci and Cj are strongly connected
2. There exists on Ci such that Ci is a proper
superset of Ci and Ci and Ci are strongly
connected
3. sD(C) of C is gt a the expected support of C
under attribute-independence assumption
Ci is a cluster-projection of C on Ai.
C is a sub-cluster if it only satisfies 1 and 3.
A cluster C over a subset of all attributes S
proper subset of A1,,An is a subspace cluster
on S.

8
Definitions similarity, inter-attribute
summaries, intra-attribute summaries

Similarity ?j(ai,a2) x ? Dj sD(a1,x)gt0 and
sD(a2,x)gt0
Inter-attribute summary
?ij(ai,aj, sD(ai,aj) ai ? Di, aj ? Dj, and
sD(ai,aj)gt0
Strongly connected attribute values pairs where
each pair has attribute values from different
attributes
Intra-attribute summary
?ij(ai,aj, ?jD(ai,aj) ai ? Di, aj ? Dj, and
?jD(ai,aj)gt0
Similarities between attribute values of the same
attribute

9
CACTUS Vs STIRR clusters found by CACTUS
10
CACTUS Vs STIRR clusters found by STIRR
11
CACTUS CAtegorical ClusTering Using Summaries

Central idea data summary (inter- intra-
attribute summary) is sufficient enough to find
candidate clusters which can then be validated.
A three-phase clustering algorithm
Summarisation
Clustering
Validation

12
Summarisation Phase

Assumption the inter- intra- attribute summary
of any pair of attributes fits easily into main
memory.
Inter-attribute Summaries
Use a counter set to 0 initially for each pair
(ai,aj) ? Di x Dj.
Scan the dataset, increment the counter for each
pair.
After the scan, compute sD(ai,aj) and reset the
counters of those whose s lt EsD(ai,aj). Store
those values pairs.
Intra-attribute Summaries
Scan the dataset and find those tuples (T1,T2) of
one domain such that T1.a is strongly connected
with T1.b and T2.a is strongly connected with
T2.b.
Very fast operation, hence only compute them when
needed

13
Clustering Phase

A two-step operation
Step 1. analyse each attribute to compute all
cluster-projections on it
Step 2. Synthesise candidate clusters on sets of
attributes from the cluster-projections on
individual attributes

14
Clustering Phase continued

Step1 Compute cluster-projections on attributes
Step A. Find all cluster-projections on Ai of
cluster over (Ai,Aj).
Step B. Compute all the cluster-projections on Ai
of cluster over A1,,An by intersecting sets of
cluster-projects from Step A.
Step A is NP-Hard! Solution use distinguishing
sets.
Distinguishing sets identify different
cluster-projections.
Construct distinguishing sets on Ai and extend
w.r.t Aj some of the candidate distinguishing
sets on Ai.
Detailed steps are too long for this
presentation, sorry!
StepB intersection of Cluster-projection
Intersection joint S1?S2 s there exist s1?S1
and s2?S2 such that ss1?s2 and sgt1
Apply intersection joint to all sets of attribute
values on Ai.
Step2 Try to augment ck with a cluster
projection ck1 on attribute Ak1. If new cluster
ltci,ck1gt is a sub-cluster on (Ai,Ak1), i ?
1,,k, then add ck1 ltc1,ck1gt to the final
cluster.

15
Validation Phase

Use a required threshold to recognise false
candidates which do not have enough support
because some of the 2-clusters combined to form a
candidate cluster may be due to different sets of
tuples.

16
Experiments and Results

To compare with STIRR
Use 1 million tuples, 10 attributes and 100
attribute values for each attribute.
CACTUS discovers a broader class of clusters than
STIRR.

17
Experiments and Results
18
Conclusion

The authors formalised the definition of a
cluster in categorical data
CACTUS is a fast and efficient algorithm for
clustering in categorical data.
I am sorry that I did not show some part of the
algorithm due to time constraint.

19
Question Time

Write a Comment

User Comments (0)