CACTUS - PowerPoint PPT Presentation

About This Presentation
Title:

CACTUS

Description:

CACTUS Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 20
Provided by: Rong81
Category:

less

Transcript and Presenter's Notes

Title: CACTUS


1
CACTUS Clustering Categorical Data Using
Summaries
  • By Venkatesh Ganti, Johannes Gehrke and Raghu
    Ramakrishnan
  • RongEn Li
  • School of Informatics, Edinburgh

2
Overview
  • Introduction and motivation
  • Existing tools for clustering categorical data
    STIRR and ROCK
  • Definition of a cluster over categorical data
  • The algorithm CACTUS
  • Experiments and results
  • Summary

3
Introduction and motivation
  • Numeric data, 1,2,3,4,5,
  • Categorical data, LFD, PMR, DME
  • Usually small number of attribute values in their
    domains. Large domains are typically hard to
    infer useful information
  • Use relations! Relations contain different
    attributes, but the cross product of domain
    attributes can be large.
  • CACTUS a fast summarisation-based algorithm
    which uses summary information to find well-fined
    clusters.

4
Existing tools for clustering categorical data
  • STIRR
  • Each attribute value is represented as a weighted
    vertex in a graph.
  • Multiple copies b1,,bm (basins) of weighted
    vertices are maintained. They can have different
    weights.
  • Starting Step a set of weights on all vertices
    in all basins.
  • Iterative Step Increment the weight in basin bi
    on vertex tj, for each vertices tuple tltt1, ,
    tngt in bi, using a function combining the weights
    of vertices other than tj in bi.
  • At fixed point the large positive weights and
    small negative weights across the basins isolate
    two groups of attribute values on each attribute.
  • ROCK
  • Starts with each tuple in its own cluster.
  • Merges close clusters until a required number
    (user specified) of clusters remains. Closeness
    defined by a similarity function.
  • Use STIRR to compare with CACTUS.

5
Definitions Interval region, support and
belonging
  • A1,,An is a set of categorical attributes with
    domains D1,,Dn respectively. D is a set tuples
    where each tuple t ? 1,,n.
  • Interval region SS1 X X Sn if Si subset of Di
    for all i ? 1,,n. Equivalent to intervals in
    numeric data
  • The support of a value pair sD(ai,aj)t?Dt.Aia
    i t.Ajaj/D. The support of a region sD(S)
    is the number of tuples in D contained in S
  • Belonging A tuple tltt.A1,t.Angt ? D belongs to
    a region S if for all t ? 1,,n, t.Ai ? Si.

6
Definitions expected support, strongly connected
  • The expected support under attribute-independence
    assumption,
  • Of a region EsD(S) DS1XXSn/D1XXDn
  • Of a pair ai and aj EsD(ai,aj)
    aD/DiXDj
  • a is normally set to 2 or 3
  • Strongly Connected
  • ai and aj if sD(ai,aj)gtEsD(ai,aj),
    sD(ai,aj)sD(ai,aj) Otherwise, 0.
  • ai ? Si w.r.t Sj for all x ? Sj, ai and x are
    strongly connected.
  • Si and Sj if each ai ? Si is strongly connected
    with each aj ? Sj and if each aj ? Sj is strongly
    connected with each ai ? Si.

7
Definitions Cluster, Cluster-projection,
sub-cluster and subspace cluster
  • CltC1,..Cngt is a cluster over A1,,An if
  • 1. Ci and Cj are strongly connected
  • 2. There exists on Ci such that Ci is a proper
    superset of Ci and Ci and Ci are strongly
    connected
  • 3. sD(C) of C is gt a the expected support of C
    under attribute-independence assumption
  • Ci is a cluster-projection of C on Ai.
  • C is a sub-cluster if it only satisfies 1 and 3.
  • A cluster C over a subset of all attributes S
    proper subset of A1,,An is a subspace cluster
    on S.

8
Definitions similarity, inter-attribute
summaries, intra-attribute summaries
  • Similarity ?j(ai,a2) x ? Dj sD(a1,x)gt0 and
    sD(a2,x)gt0
  • Inter-attribute summary
  • ?ij(ai,aj, sD(ai,aj) ai ? Di, aj ? Dj, and
    sD(ai,aj)gt0
  • Strongly connected attribute values pairs where
    each pair has attribute values from different
    attributes
  • Intra-attribute summary
  • ?ij(ai,aj, ?jD(ai,aj) ai ? Di, aj ? Dj, and
    ?jD(ai,aj)gt0
  • Similarities between attribute values of the same
    attribute

9
CACTUS Vs STIRR clusters found by CACTUS
10
CACTUS Vs STIRR clusters found by STIRR
11
CACTUS CAtegorical ClusTering Using Summaries
  • Central idea data summary (inter- intra-
    attribute summary) is sufficient enough to find
    candidate clusters which can then be validated.
  • A three-phase clustering algorithm
  • Summarisation
  • Clustering
  • Validation

12
Summarisation Phase
  • Assumption the inter- intra- attribute summary
    of any pair of attributes fits easily into main
    memory.
  • Inter-attribute Summaries
  • Use a counter set to 0 initially for each pair
    (ai,aj) ? Di x Dj.
  • Scan the dataset, increment the counter for each
    pair.
  • After the scan, compute sD(ai,aj) and reset the
    counters of those whose s lt EsD(ai,aj). Store
    those values pairs.
  • Intra-attribute Summaries
  • Scan the dataset and find those tuples (T1,T2) of
    one domain such that T1.a is strongly connected
    with T1.b and T2.a is strongly connected with
    T2.b.
  • Very fast operation, hence only compute them when
    needed

13
Clustering Phase
  • A two-step operation
  • Step 1. analyse each attribute to compute all
    cluster-projections on it
  • Step 2. Synthesise candidate clusters on sets of
    attributes from the cluster-projections on
    individual attributes

14
Clustering Phase continued
  • Step1 Compute cluster-projections on attributes
  • Step A. Find all cluster-projections on Ai of
    cluster over (Ai,Aj).
  • Step B. Compute all the cluster-projections on Ai
    of cluster over A1,,An by intersecting sets of
    cluster-projects from Step A.
  • Step A is NP-Hard! Solution use distinguishing
    sets.
  • Distinguishing sets identify different
    cluster-projections.
  • Construct distinguishing sets on Ai and extend
    w.r.t Aj some of the candidate distinguishing
    sets on Ai.
  • Detailed steps are too long for this
    presentation, sorry!
  • StepB intersection of Cluster-projection
  • Intersection joint S1?S2 s there exist s1?S1
    and s2?S2 such that ss1?s2 and sgt1
  • Apply intersection joint to all sets of attribute
    values on Ai.
  • Step2 Try to augment ck with a cluster
    projection ck1 on attribute Ak1. If new cluster
    ltci,ck1gt is a sub-cluster on (Ai,Ak1), i ?
    1,,k, then add ck1 ltc1,ck1gt to the final
    cluster.

15
Validation Phase
  • Use a required threshold to recognise false
    candidates which do not have enough support
    because some of the 2-clusters combined to form a
    candidate cluster may be due to different sets of
    tuples.

16
Experiments and Results
  • To compare with STIRR
  • Use 1 million tuples, 10 attributes and 100
    attribute values for each attribute.
  • CACTUS discovers a broader class of clusters than
    STIRR.

17
Experiments and Results
18
Conclusion
  • The authors formalised the definition of a
    cluster in categorical data
  • CACTUS is a fast and efficient algorithm for
    clustering in categorical data.
  • I am sorry that I did not show some part of the
    algorithm due to time constraint.

19
Question Time
Write a Comment
User Comments (0)
About PowerShow.com