Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides - PowerPoint PPT Presentation

Loading...

PPT – Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides PowerPoint presentation | free to download - id: 4009bc-YWRlN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides

Description:

The DBODLP Algorithm The algorithm is implemented based on a vertical storage and index model Vertical Data Structures ... mathematical model Statistical ... discrete ... – PowerPoint PPT presentation

Number of Views:259
Avg rating:3.0/5.0
Slides: 157
Provided by: Amb867
Learn more at: http://cs.ndsu.nodak.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Clustering Analysis (of Spatial Data and using Peano Count Trees) (Ptree technology is patented by NDSU) Notes: 1. over 100 slides


1
Clustering Analysis (of Spatial Data and using
Peano Count Trees) (Ptree technology is patented
by NDSU) Notes 1. over 100 slides not
going to go through each in detail.
2
Clustering Methods
  • A Categorization of Major Clustering Methods
  • Partitioning methods
  • Hierarchical methods
  • Density-based methods
  • Grid-based methods
  • Model-based methods

3
Clustering Methods based on Partitioning
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • k-means (MacQueen67) Each cluster is
    represented by the center of the cluster
  • k-medoids or PAM method (Partition Around
    Medoids) (Kaufman Rousseeuw87) Each cluster
    is represented by 1 object in the cluster ( the
    middle object or median-like object)

4
The K-Means Clustering Method
  • Given k, the k-means algorithm is implemented in
    4 steps (assumes partitioning criteria is
    maximize intra-cluster similarity and minimize
    inter-cluster similarity. Of course, a heuristic
    is used. Method isnt really an optimization)
  • Partition objects into k nonempty subsets (or
    pick k initial means).
  • Compute the mean (center) or centroid of each
    cluster of the current partition (if one started
    with k means, then this step is done).
  • centroid point that minimizes the sum of
    dissimilarities from the mean or the sum of the
    square errors from the mean.
  • Assign each object to the cluster with the most
    similar (closest) center.
  • Go back to Step 2
  • Stop when the new set of means doesnt change
    (or some other stopping condition?)

5
k-Means
Step 1
Step 2
Step 3
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Step 4
6
The K-Means Clustering Method
  • Strength
  • Relatively efficient O(tkn),
  • n is objects,
  • k is clusters
  • t is iterations. Normally, k, t ltlt n.
  • Weakness
  • Applicable only when mean is defined (e.g., a
    vector space)
  • Need to specify k, the number of clusters, in
    advance.
  • It is sensitive to noisy data and outliers since
    a small number of such data can substantially
    influence the mean value.

7
The K-Medoids Clustering Method
  • Find representative objects, called medoids,
    (must be an actual object in the cluster, where
    as the mean seldom is).
  • PAM (Partitioning Around Medoids, 1987)
  • starts from an initial set of medoids
  • iteratively replaces one of the medoids by a
    non-medoid
  • if it improves the aggregate similarity measure,
    retain the swap. Do this over all
    medoid-nonmedoid pairs
  • PAM works for small data sets. Does not scale
    for large data sets
  • CLARA (Clustering LARge Applications)
    (Kaufmann,Rousseeuw, 1990) Sub-samples then apply
    PAM
  • CLARANS (Clustering Large Applications based on
    RANdom
  • Search) (Ng Han, 1994) Randomized the
    sampling

8
PAM (Partitioning Around Medoids) (1987)
  • Use real object to represent the cluster
  • Select k representative objects arbitrarily
  • For each pair of non-selected object h and
    selected object i, calculate the total swapping
    cost TCi,h
  • For each pair of i and h,
  • If TCi,h lt 0, i is replaced by h
  • Then assign each non-selected object to the most
    similar representative object
  • repeat steps 2-3 until there is no change

9
CLARA (Clustering Large Applications) (1990)
  • CLARA (Kaufmann and Rousseeuw in 1990)
  • It draws multiple samples of the data set,
    applies PAM on each sample, and gives the best
    clustering as the output
  • Strength deals with larger data sets than PAM
  • Weakness
  • Efficiency depends on the sample size
  • A good clustering based on samples will not
    necessarily represent a good clustering of the
    whole data set if the sample is biased

10
CLARANS (Randomized CLARA) (1994)
  • CLARANS (A Clustering Algorithm based on
    Randomized Search) (Ng and Han94)
  • CLARANS draws sample of neighbors dynamically
  • The clustering process can be presented as
    searching a graph where every node is a potential
    solution, that is, a set of k medoids
  • If the local optimum is found, CLARANS starts
    with new randomly selected node in search for a
    new local optimum (Genetic-Algorithm-like)
  • Finally the best local optimum is chosen after
    some stopping condition.
  • It is more efficient and scalable than both PAM
    and CLARA

11
Distance-based partitioning has drawbacks
  • Simple and fast O(N)
  • The number of clusters, K, has to be arbitrarily
    chosen before it is known how many clusters is
    correct.
  • Produces round shaped clusters, not arbitrary
    shapes (Chameleon data set below)
  • Sensitive to the selection of the initial
    partition and may converge to a local minimum of
    the criterion function if the initial partition
    is not well chosen.

Correct result
K-means result
12
Distance-based partitioning (Cont.)
  • If we start with A, B, and C as the initial
    centriods around which the three clusters are
    built, then we end up with the partition A,
    B, C, D, E, F, G shown by ellipses.
  • Whereas, the correct three-cluster solution is
    obtained by choosing, for example, A, D, and F as
    the initial cluster means (rectangular clusters).

13
A Vertical Data Approach
  • Partition the data set using rectangle P-trees (a
    gridding)
  • These P-trees can be viewed as a grouping
    (partition) of data
  • Pruning out outliers by disregard those sparse
    values
  • Input total number of objects (N), percentage
    of outliers (t)
  • Output Grid P-trees after prune
  • (1) Choose the Grid P-tree with smallest root
    count (Pgc)
  • (2) outliersoutliers OR Pgc
  • (3) if (outliers/Nlt t) then remove Pgc and
    repeat (1)(2)
  • Finding clusters using PAM method each grid
    P-tree is an object
  • Note when we have a P-tree mask for each
    cluster, the mean is just
  • the vector sum of the basic Ptrees ANDed with
    the cluster Ptree,
  • divided by the rootcount of the cluster Ptree

14
Distance Function
  • Data Matrix
  • n objects p variables
  • Dissimilarity Matrix
  • n objects n objects

15
AGNES (Agglomerative Nesting)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Use the Single-Link (distance between two sets is
    the minimum pairwise distance) method
  • Merge nodes that are most similarity
  • Eventually all nodes belong to the same cluster

16
DIANA (Divisive Analysis)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Inverse order of AGNES (intitially all objects
    are in one cluster then it is split according to
    some criteria (e.g., maximize some aggregate
    measure of pairwise dissimilarity again)
  • Eventually each node forms a cluster on its own

17
Contrasting Clustering Techniques
  • Partitioning algorithms Partition a dataset to k
    clusters, e.g., k3 ?
  • Hierarchical alg Create hierarchical
    decomposition of ever-finer partitions.
  • e.g., top down (divisively).

bottom up (agglomerative)
18
Hierarchical Clustering
19
Hierarchical Clustering (top down)
  • In either case, one gets a nice dendogram in
    which any maximal anti-chain (no 2 nodes linked)
    is a clustering (partition).

20
Hierarchical Clustering (Cont.)
Recall that any maximal anti-chain (maximal set
of nodes in which no 2 are chained) is a
clustering (a dendogram offers many).
21
Hierarchical Clustering (Cont.)
But the horizontal anti-chains are the
clusterings resulting from the top down (or
bottom up) method(s).
22
Hierarchical Clustering (Cont.)
  • Most hierarchical clustering algorithms are
    variants of the single-link, complete-link or
    average link.
  • Of these, single-link and complete link are most
    popular.
  • In the single-link method, the distance between
    two clusters is the minimum of the distances
    between all pairs of patterns drawn one from each
    cluster.
  • In the complete-link algorithm, the distance
    between two clusters is the maximum of all
    pairwise distances between pairs of patterns
    drawn one from each cluster.
  • In the average-link algorithm, the distance
    between two clusters is the average of all
    pairwise distances between pairs of patterns
    drawn one from each cluster (which is the same as
    the distance between the means in the vector
    space case easier to calculate).

23
Distance Between Clusters
  • Single Link smallest distance between any pair
    of points from two clusters
  • Complete Link largest distance between any pair
    of points from two clusters

24
Distance between Clusters (Cont.)
  • Average Link average distance between points
    from two clusters
  • Centroid distance between centroids of the two
    clusters

25
Single Link vs. Complete Link (Cont.)
Single link works but not complete link
Complete link works but not single link
Complete link works but not single link
26
Single Link vs. Complete Link (Cont.)
Single link works
Complete link doesnt
27
Single Link vs. Complete Link (Cont.)
Single link doesnt works
Complete link does
1-cluster noise 2-cluster
28
Hierarchical vs. Partitional
  • Hierarchical algorithms are more versatile than
    partitional algorithms.
  • For example, the single-link clustering algorithm
    works well on data sets containing non-isotropic
    (non-roundish) clusters including well-separated,
    chain-like, and concentric clusters, whereas a
    typical partitional algorithm such as the k-means
    algorithm works well only on data sets having
    isotropic clusters.
  • On the other hand, the time and space
    complexities of the partitional algorithms are
    typically lower than those of the hierarchical
    algorithms.

29
More on Hierarchical Clustering Methods
  • Major weakness of agglomerative clustering
    methods
  • do not scale well time complexity of at least
    O(n2), where n is the number of total objects
  • can never undo what was done previously (greedy
    algorithm)
  • Integration of hierarchical with distance-based
    clustering
  • BIRCH (1996) uses Clustering Feature tree
    (CF-tree) and incrementally adjusts the quality
    of sub-clusters
  • CURE (1998) selects well-scattered points from
    the cluster and then shrinks them towards the
    center of the cluster by a specified fraction
  • CHAMELEON (1999) hierarchical clustering using
    dynamic modeling

30
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

31
Density-Based Clustering Background
  • Two parameters
  • ? Maximum radius of the neighbourhood
  • MinPts Minimum number of points in an ?
    -neighbourhood of that point
  • N?(p) q belongs to D dist(p,q) ? ?
  • Directly (density) reachable A point p is
    directly density-reachable from a point q wrt. ?,
    MinPts if
  • 1) p belongs to N?(q)
  • 2) q is a core point
  • N?(q) ? MinPts

32
Density-Based Clustering Background (II)
  • Density-reachable
  • A point p is density-reachable from a point q
    (?p) wrt ?, MinPts if there is a chain of points
    p1, , pn, p1q, pnp such that pi1 is
    directly density-reachable from pi
  • ?q, q is density-reachable from q.
  • Density reachability is reflexive and transitive,
    but not symmetric, since only core objects can
    be density reachable to each other.
  • Density-connected
  • A point p is density-connected to a q wrt ?,
    MinPts if there is a point o such that both, p
    and q are density-reachable from o wrt ?, MinPts.
  • Density reachability is not symmetric, Density
    connectivity inherits the reflexivity and
    transitivity and provides the symmetry. Thus,
    density connectivity is an equivalence relation
    and therefore gives a partition (clustering).

p
p1
q
p
q
o
33
DBSCAN Density Based Spatial Clustering of
Applications with Noise
  • Relies on a density-based notion of cluster A
    cluster is defined as an equivalence class of
    density-connected points.
  • Which gives the transitive property for the
    density connectivity binary relation and
    therefore it is an equivalence relation whose
    components form a partition (clustering)
    according to the duality.
  • Discovers clusters of arbitrary shape in spatial
    databases with noise

Outlier
Border
? 1cm MinPts 3
Core
34
DBSCAN The Algorithm
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p wrt
    ?, MinPts.
  • If p is a core point, a cluster is formed (note
    it doesnt matter which of the core points within
    a cluster you start at since density reachability
    is symmetric on core points.)
  • If p is a border point or an outlier, no points
    are density-reachable from p and DBSCAN visits
    the next point of the database. Keep track of
    such points. If they dont get scooped up by a
    later core point, then they are outliers.
  • Continue the process until all of the points have
    been processed.
  • What about a simpler version of DBSCAN
  • Define core points and core neighborhoods the
    same way.
  • Define (undirected graph) edge between two points
    if they cohabitate a core nbrhd.
  • The connectivity component partition is the
    clustering.
  • Other related method? How does vertical
    technology help here? Gridding?

35
OPTICS
  • Ordering Points To Identify Clustering Structure
  • Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
  • http//portal.acm.org/citation.cfm?id304187
  • Addresses the shortcoming of DBSCAN, namely
    choosing parameters.
  • Develops a special order of the database wrt its
    density-based clustering structure
  • This cluster-ordering contains info equivalent to
    the density-based clusterings corresponding to a
    broad range of parameter settings
  • Good for both automatic and interactive cluster
    analysis, including finding intrinsic clustering
    structure

36
OPTICS
Does this order resemble the Total Variation
order?
37
OPTICS Some Extension from DBSCAN
  • Index-based
  • k number of dimensions
  • N number of points (20)
  • p 75
  • M N(1-p) 5
  • Complexity O(kN2)
  • Core Distance
  • Reachability Distance

D
p1
o
p2
o
Max (core-distance (o), d (o, p)) r(p1, o)
2.8cm. r(p2,o) 4cm
MinPts 5 e 3 cm
38
Reachability-distance
undefined

Cluster-order of the objects
39
DENCLUE using density functions
  • DENsity-based CLUstEring by Hinneburg Keim
    (KDD98)
  • Major features
  • Solid mathematical foundation
  • Good for data sets with large amounts of noise
  • Allows a compact mathematical description of
    arbitrarily shaped clusters in high-dimensional
    data sets
  • Significant faster than existing algorithm
    (faster than DBSCAN by a factor of up to 45
    claimed by authors ???)
  • But needs a large number of parameters

40
Denclue Technical Essence
  • Uses grid cells but only keeps information about
    grid cells that do actually contain data points
    and manages these cells in a tree-based access
    structure.
  • Influence function describes the impact of a
    data point within its neighborhood.
  • F(x,y) measures the influence that y has on x.
  • A very good influence function is the Gaussian,
    F(x,y) e d2(x,y)/2?
  • Others include functions similar to the squashing
    functions used in neural networks.
  • One can think of the influence function as a
    measure of the contribution to the density at x
    made by y.
  • Overall density of the data space can be
    calculated as the sum of the influence function
    of all data points.
  • Clusters can be determined mathematically by
    identifying density attractors.
  • Density attractors are local maximal of the
    overall density function.

41
DENCLUE(D,s,?c,?)
  1. Grid Data Set (use r s, the std. dev.)
  2. Find (Highly) Populated Cells (use a
    threshold?c) (shown in blue)
  3. Identify populated cells (nonempty cells)
  4. Find Density Attractor pts, C, using hill
    climbing
  5. Randomly pick a point, pi.
  6. Compute local density (use r4s)
  7. Pick another point, pi1, close to pi, compute
    local density at pi1
  8. If LocDen(pi) lt LocDen(pi1), climb
  9. Put all points within distance s/2 of path, pi,
    pi1, C into a density attractor cluster
    called C
  10. Connect the density attractor clusters, using a
    threshold, ?, on the local densities of the
    attractors.

A. Hinneburg and D. A. Keim. An Efficient
Approach to Clustering in Multimedia Databases
with Noise. In Proc. 4th Int. Conf. on Knowledge
Discovery and Data Mining. AAAI Press, 1998.
KDD 99 Workshop.
42
Comparison DENCLUE Vs DBSCAN
43
(No Transcript)
44
BIRCH (1996)
  • Birch Balanced Iterative Reducing and Clustering
    using Hierarchies, by Zhang, Ramakrishnan, Livny
    (SIGMOD96
  • http//portal.acm.org/citation.cfm?id235968.23332
    4dlGUIDEdlACMidx235968partperiodicalWantT
    ypeperiodicaltitleACM20SIGMOD20RecordCFID16
    013608CFTOKEN14462336
  • Incrementally construct a CF (Clustering Feature)
    tree, a hierarchical data structure for
    multiphase clustering
  • Phase 1 scan DB to build an initial in-memory CF
    tree (a multi-level compression of the data that
    tries to preserve the inherent clustering
    structure of the data)
  • Phase 2 use an arbitrary clustering algorithm to
    cluster the leaf nodes of the CF-tree
  • Scales linearly finds a good clustering with a
    single scan and improves quality with a few
    additional scans
  • Weakness handles only numeric data, and
    sensitive to the order of the data record.

45
BIRCH
  • ABSTRACT
  • Finding useful patterns in large datasets has
    attracted considerable interest recently, and one
    of the most widely studied problems in this area
    is the identification of clusters, or densely
    populated regions, in a multi-dimensional
    dataset.
  • Prior work does not adequately address the
    problem of large datasets and minimization of I/O
    costs.
  • This paper presents a data clustering method
    named BIRCH (Balanced Iterative Reducing and
    Clustering using Hierarchies), and demonstrates
    that it is especially suitable for very large
    databases.
  • BIRCH incrementally and dynamically clusters
    incoming multi-dimensional metric data points to
    try to produce the best quality clustering with
    the available resources (i.e., available memory
    and time constraints).
  • BIRCH can typically find a good clustering with a
    single scan of the data, and improve the quality
    further with a few additional scans.
  • BIRCH is also the first clustering algorithm
    proposed in the database area to handle "noise"
    (data points that are not part of the underlying
    pattern) effectively.
  • We evaluate BIRCH's time/space efficiency, data
    input order sensitivity, and clustering quality
    through several experiments.

46
Clustering Feature Vector
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
47
Birch
Iteratively put points into closest leaf until
threshold is exceed, then split leaf. Inodes
summarize their subtrees and Inodes get split
when threshold is exceeded. Once in-memory CF
tree is built, use another method to cluster
leaves together.
Branching factor, B6 Threshold, L 7
48
CURE (Clustering Using REpresentatives )
  • CURE proposed by Guha, Rastogi Shim, 1998
  • http//portal.acm.org/citation.cfm?id276312
  • Stops the creation of a cluster hierarchy if a
    level consists of k clusters
  • Uses multiple representative points to evaluate
    the distance between clusters
  • adjusts well to arbitrary shaped clusters (not
    necessarily distance-based
  • avoids single-link effect

49
Drawbacks of Distance-Based Method
  • Drawbacks of square-error based clustering method
  • Consider only one point as representative of a
    cluster
  • Good only for convex shaped, similar size and
    density, and if k can be reasonably estimated

50
Cure The Algorithm
  • Very much a hybrid method (involves pieces from
    many others).
  • Draw random sample s.
  • Partition sample to p partitions with size s/p
  • Partially cluster partitions into s/pq clusters
  • Eliminate outliers
  • By random sampling
  • If a cluster grows too slow, eliminate it.
  • Cluster partial clusters.
  • Label data in disk

51
Cure
  • ABSTRACT
  • Clustering, in data mining, is useful for
    discovering groups and identifying interesting
    distributions in the underlying data. Traditional
    clustering algorithms either favor clusters with
    spherical shapes and similar sizes, or are very
    fragile in the presence of outliers.
  • We propose a new clustering algorithm called CURE
    that is more robust to outliers, and identifies
    clusters having non-spherical shapes and wide
    variances in size.
  • CURE achieves this by representing each cluster
    by a certain fixed number of points that are
    generated by selecting well scattered points from
    the cluster and then shrinking them toward the
    center of the cluster by a specified fraction.
  • Having more than one representative point per
    cluster allows CURE to adjust well to the
    geometry of non-spherical shapes and the
    shrinking helps to dampen the effects of
    outliers.
  • To handle large databases, CURE employs a
    combination of random sampling and partitioning.
    A random sample drawn from the data set is first
    partitioned and each partition is partially
    clustered. The partial clusters are then
    clustered in a second pass to yield the desired
    clusters.
  • Our experimental results confirm that the quality
    of clusters produced by CURE is much better than
    those found by existing algorithms.
  • Furthermore, they demonstrate that random
    sampling and partitioning enable CURE to not only
    outperform existing algorithms but also to scale
    well for large databases without sacrificing
    clustering quality.

52
Data Partitioning and Clustering
  • s 50
  • p 2
  • s/p 25
  • s/pq 5

x
x
53
Cure Shrinking Representative Points
  • Shrink the multiple representative points towards
    the gravity center by a fraction of ?.
  • Multiple representatives capture the shape of the
    cluster

54
Clustering Categorical Data ROCK http//portal.ac
m.org/citation.cfm?id351745
  • ROCK Robust Clustering using linKs, by S. Guha,
    R. Rastogi, K. Shim (ICDE99).
  • Agglomerative Hierarchical
  • Use links to measure similarity/proximity
  • Not distance based
  • Computational complexity
  • Basic ideas
  • Similarity function and neighbors
  • Let T1 1,2,3, T23,4,5

55
ROCK
  • Abstract
  • Clustering, in data mining, is useful to discover
    distribution patterns in the underlying data.
  • Clustering algorithms usually employ a distance
    metric based (e.g., euclidean) similarity measure
    in order to partition the database such that data
    points in the same partition are more similar
    than points in different partitions.
  • In this paper, we study clustering algorithms for
    data with boolean and categorical attributes.
  • We show that traditional clustering algorithms
    that use distances between points for clustering
    are not appropriate for boolean and categorical
    attributes. Instead, we propose a novel concept
    of links to measure the similarity/proximity
    between a pair of data points.
  • We develop a robust hierarchical clustering
    algorithm ROCK that employs links and not
    distances when merging clusters.
  • Our methods naturally extend to non-metric
    similarity measures that are relevant in
    situations where a domain expert/similarity table
    is the only source of knowledge.
  • In addition to presenting detailed complexity
    results for ROCK, we also conduct an experimental
    study with real-life as well as synthetic data
    sets to demonstrate the effectiveness of our
    techniques.
  • For data with categorical attributes, our
    findings indicate that ROCK not only generates
    better quality clusters than traditional
    algorithms, but it also exhibits good scalability
    properties.

56
Rock Algorithm
  • Links The number of common neighbors for the
    two pts
  • Algorithm
  • Draw random sample
  • Cluster with links
  • Label data in disk

1,2,3, 1,2,4, 1,2,5, 1,3,4,
1,3,5 1,4,5, 2,3,4, 2,3,5, 2,4,5,
3,4,5
3
1,2,3 1,2,4
57
CHAMELEON
  • CHAMELEON hierarchical clustering using dynamic
    modeling, by G. Karypis, E.H. Han and V. Kumar99
  • http//portal.acm.org/citation.cfm?id621303
  • Measures the similarity based on a dynamic model
  • Two clusters are merged only if the
    interconnectivity and closeness (proximity)
    between two clusters are high relative to the
    internal interconnectivity of the clusters and
    closeness of items within the clusters
  • A two phase algorithm
  • 1. Use a graph partitioning algorithm cluster
    objects into a large number of relatively small
    sub-clusters
  • 2. Use an agglomerative hierarchical clustering
    algorithm find the genuine clusters by
    repeatedly combining these sub-clusters

58
CHAMELEON
  • ABSTRACT
  • Many advanced algorithms have difficulty dealing
    with highly variable clusters that do not follow
    a preconceived model.
  • By basing its selections on both
    interconnectivity and closeness, the Chameleon
    algorithm yields accurate results for these
    highly variable clusters.
  • Existing algorithms use a static model of the
    clusters and do not use information about the
    nature of individual clusters as they are merged.
  • Furthermore, one set of schemes (the CURE
    algorithm and related schemes) ignores the
    information about the aggregate interconnectivity
    of items in two clusters.
  • Another set of schemes (the Rock algorithm, group
    averaging method, and related schemes) ignores
    information about the closeness of two clusters
    as defined by the similarity of the closest items
    across two clusters.
  • By considering either interconnectivity or
    closeness only, these algorithms can select and
    merge the wrong pair of clusters.
  • Chameleon's key feature is that it accounts for
    both interconnectivity and closeness in
    identifying the most similar pair of clusters.
  • Chameleon finds the clusters in the data set by
    using a two-phase algorithm.
  • During the first phase, Chameleon uses a
    graph-partitioning algorithm to cluster the data
    items into several relatively small subclusters.
  • During the second phase, it uses an algorithm to
    find the genuine clusters by repeatedly combining
    these sub-clusters.

59
Overall Framework of CHAMELEON
Construct Sparse Graph
Partition the Graph
Data Set
Merge Partition
Final Clusters
60
Grid-Based Clustering Method
  • Using multi-resolution grid data structure
  • Several interesting methods
  • STING (a STatistical INformation Grid approach)
    by Wang, Yang and Muntz (1997)
  • WaveCluster by Sheikholeslami, Chatterjee, and
    Zhang (VLDB98)
  • A multi-resolution clustering approach using
    wavelet method
  • CLIQUE Agrawal, et al. (SIGMOD98)

61
Vertical gridding
We can observe that almost all methods discussed
so far suffer from the curse of cardinality (for
very large cardinality data sets, the algorithms
are too slow to finish in the average life time!)
and/or the curse of dimensionality (points are
all at same distance). The work-arounds
employed to address the curses sampling (throw
out most of the points in a way that what remains
is low enough cardinality for the algorithm to
finish and in such a way that the remaining
sample contains all the information of the
original data set (Therein is the problem that
is impossible to do in general) Gridding
(agglomerate all points in a grid cell and treat
them as one point (smooth the data set to this
gridding level). The problem with gridding,
often, is that info is lost and the data
structure that holds the grid cell information is
very complex. With vertical methods (e.g.,
P-trees), all the info can be retained and
griddings can be constructed very efficiently on
demand. Horizontal data structures cant do
this. Subspace restrictions (e.g., Principal
Components, Subspace Clustering) Gradient based
methods (e.g., the gradient tangent vector field
of a response surface reduces the calculations to
the number of dimensions, not the number of
combinations of dimensions.) j-hi gridding the
j hi order bits identify a grid cells and the
rest identify points in a particular cell.
Thus, j-hi cells are not necessarily cubical
(unless all attribute bit-widths are the same).
j-lo gridding the j lo order bits identify
points in a particular cell and the rest identify
a grid cell. Thus, j-lo cells always have
a nice uniform shape (cubical).
62
1-hi gridding of Vector Space, R(A1, A2, A3) in
which all bit-widths are the same 3 (so each
grid cell contains 22 22 22 64 potential
points). Grid cells are identified by their
Peano id (Pid) internally the points cell
coordinates are shown - called the grid cell id
and
cell points are ided by coordinates
within their cell.
1
gci001 gcp 00,11,00
gci001 gcp 01,11,00
gci001 gcp 11,11,00
gci001 gcp 10,11,00
gci001 gcp 11,11,01
gci001 gcp 00,11,01
gci001 gcp 01,11,01
gci001 gcp 10,11,01
Pid 001
gci001 gcp 11,11,10
gci001 gcp 00,11,10
gci001 gcp 10,11,10
gci001 gcp 01,11,10
gci001 gcp 00,10,00
gci001 gcp 11,10,00
gci001 gcp 01,10,00
gci001 gcp 11,11,11
gci001 gcp 00,11,11
gci001 gcp 10,11,11
gci001 gcp 01,11,11
gci001 gcp 10,10,00
gci001 gcp 00,10,01
gci001 gcp 11,10,01
gci001 gcp 01,10,01
gci001 gcp 10,10,01
gci001 gcp 00,10,10
gci001 gcp 11,10,10
gci001 gcp 10,10,10
gci001 gcp 01,10,10
gci001 gcp 00,10,11
gci001 gcp 00,01,00
gci001 gcp 01,10,11
gci001 gcp 11,10,11
gci001 gcp 01,01,00
gci001 gcp 10,10,11
gci001 gcp 10,01,00
gci001 gcp 11,01.00
gci001 gcp 00,01,01
gci001 gcp 01,01,01
gci001 gcp 10,01,01
gci001 gcp 11,01.01
gci001 gcp 01,01,10
gci001 gcp 00,01,10
A3 hi-bit
gci001 gcp 10,01,10
gci001 gcp 11,01,10
0
gci001 gcp 01,01,11
gci001 gcp 11,01,11
gci001 gcp 00,01,11
gci001 gcp 10,01,11
gci001 gcp 00,00,00
gci001 gcp 01,00,00
gci001 gcp 11,00,00
gci001 gcp 10,00,00
gci001 gcp 00,00,01
gci001 gcp 01,00,01
gci001 gcp 10,00,01
gci001 gcp 11,00,01
0
gci001 gcp 00,00,10
gci001 gcp 01,00,10
gci001 gcp 10,00,10
gci001 gcp 11,00,10
gci001 gcp 00,00,11
gci001 gcp 01,00,11
gci001 gcp 10,00,11
gci001 gcp 11,00,11
1
A2 hi-bit
0
1
A1 hi-bit
63
2-hi gridding of Vector Space, R(A1, A2, A3) in
which all bitwidths are the same 3 (so each
grid cell contains 21 21 21 8 points).
11
10
A2
Pid 001.001
01
00
A3
gci 00,00,11 gcp0,1,0
gci 00,00,11 gcp1,1,0
gci 00,00,11 gcp0,1,1
gci 00,00,11 gcp1,1,1
01
gci 00,00,11 gcp0,0,0
gci 00,00,11 gcp1,0,0
00
10
gci 00,00,11 gcp0,0,1
gci 00,00,11 gcp1,0,1
11
00
01
10
11
64
1-hi gridding of R(A1, A2, A3), bitwidths of
3,2,3
Pid 001
gci001 gcp 11,1,00
gci001 gcp 00,1,00
gci001 gcp 10,1,00
gci001 gcp 01,1,00
1
gci001 gcp 01,1,01
gci001 gcp 00,1,01
gci001 gcp 10,1,01
gci001 gcp 11,1.01
gci001 gcp 00,1,10
A3 hi-bit
gci001 gcp 01,1,10
gci001 gcp 10,1,10
gci001 gcp 11,1,10
0
gci001 gcp 01,0,00
gci001 gcp 00,1,11
gci001 gcp 10,1,11
gci001 gcp 11,1,11
gci001 gcp 00,0,00
gci001 gcp 01,1,11
gci001 gcp 11,0,00
gci001 gcp 10,0,00
gci001 gcp 00,0,01
gci001 gcp 01,0,01
gci001 gcp 11,0,01
gci001 gcp 10,0,01
0
gci001 gcp 00,0,10
gci001 gcp 01,0,10
gci001 gcp 10,0,10
gci001 gcp 11,0,10
gci001 gcp 01,0,11
gci001 gcp 10,0,11
gci001 gcp 11,0,11
gci001 gcp 00,0,11
1
A2 hi-bit
0
1
A1 hi-bit
65
2-hi gridding) of R(A1, A2, A3), bitwidths of
3,2,3 (each grid cell contains 21 20 21 4
potential pts).
11
00
10
gcp 1,,0
A3 2-hi-bit
gcp 0,,0
gcp 0,,1
gcp 1,,1
01
01
10
00
11
A2 2-hi-bit
00
01
10
11
A1 2-hi-bit
Pid 3.1.3
66
HOBBit disks and rings (HOBBit Hi Order
Bifurcation Bit) 4-lo grid where A1,A2,A3 have
bit-widths, b11, b21, b31, HOBBit grid
centers are points of the form (exactly one per
grid cell) x(x1,b1..x1,41010, x2,b2..x2,41010,
x3,b3..x3,41010) where xi,js range over all
binary patterns HOBBit disk about x, of radius
20 , H(x,20). Note we have switched the
direction of A3
(x1,b1..x1,41010, x2,b2..x2,41011,
x3,b3..x3,41011)?
? (x1,b1..x1,41011, x2,b2..x2,41011,
x3,b3..x3,41011)
gcp 1011, 1011, 1011
gcp 1010, 1011, 1011
gcp 1010, 1011, 1010
gcp 1011, 1011, 1010
?(x1,b1..x1,41011, x2,b2..x2,41011,
x3,b3..x3,41010)
(x1,b1..x1,41010, x2,b2..x2,41011,
x3,b3..x3,41010)?
gcp 1010, 1010, 1011
gcp 1011, 1010, 1011
gcp 1010, 1010, 1010
gcp 1011, 1010, 1010
?(x1,b1..x1,41011, x2,b2..x2,41010,
x3,b3..x3,41011)
(x1,b1..x1,41010, x2,b2..x2,41010,
x3,b3..x3,41011)?
(x1,b1..x1,41010, x2,b2..x2,41010,
x3,b3..x3,41010)?
?(x1,b1..x1,41011, x2,b2..x2,41010,
x3,b3..x3,41010)
A3
67
H(x,21) HOBBit disk about a HOBBit grid center
pt, x
, of radius 21
(x1,b1..x1,41010, x2,b2..x2,41010,
x3,b3..x3,41010)
A2
A3
(x1,b1..x1,41000, x2,b2..x2,41011,
x3,b3..x3,41011)?
?(x1,b1..x1,41011, x2,b2..x2,41011,
x3,b3..x3,41011)
?(x1,b1..x1,41011, x2,b2..x2,41010,
x3,b3..x3,41011)
(x1,b1..x1,41000, x2,b2..x2,41011,
x3,b3..x3,41000)?
A1
(x1,b1..x1,41010, x2,b2..x2,41010,
x3,b3..x3,41010)
?(x1,b1..x1,41011, x2,b2..x2,41000,
x3,b3..x3,41011)
?(x1,b1..x1,41011, x2,b2..x2,41000,
x3,b3..x3,41010)
?(x1,b1..x1,41011, x2,b2..x2,41000,
x3,b3..x3,41001)
(x1,b1..x1,41000, x2,b2..x2,41000,
x3,b3..x3,41000)?
?(x1,b1..x1,41011, x2,b2..x2,41000,
x3,b3..x3,41000)
68
The Regions of H(x,21) are as follows
69
These REGIONS are labeled with dimensions in
which length is increased (e.g., all three
dimensions are increased below).
A2
A3
A1
123-REGION
70
A2
A3
A1
13-REGION
71
A2
A3
A1
23-REG
72
A2
A3
A1
12-REGION
73
A2
A3
3-REG
A1
74
A2
A3
A1
2-REG
75
A2
A3
1-REGION
A1
76
H(x,20) 123-REG Of H(x,20)
A2
A3
A1
77
(No Transcript)
78
  • Algorithm (for computing gradients)
  • Select an outlier threshold, (pts without
    neighbors in their ot L?-disk are outliers That
    is, there is no gradient at these outlier points
    (instantaneous rate of response change is zero).
  • Create an j-lo grid with jot (see
    previous slides - where HOBBit disks are built
    out from HOBBit centers
  • x ( x1,b1x1,ot11010
    , , xn,bnxn,ot11010 ), xi,js ranging over
    all binary patterns).
  • Pick a point, x in R. Build out alternating
    one-sided-rings centered at x until a neighbor is
    found or radius ot is exceeded (in which case x
    is declared an outlier). If a neighbor is found
    at a raduis, ri lt ot 2j, ?f/ ?xk(x) is
    estimated as below
  • Note one can use L?-HOBBit or L? ordinary
    distance.
  • Note One-sided means that each successive build
    out increases aternatively only in the positive
    direction in all dimensions then only in the
    negative direction in all dimensions.
  • Note Building out HOBBit disks from a HOBBit
    center automatically gives one-sided rings (a
    built-out ring is defined to be the built-out
    disk minus the previous built-out disk) as shown
    in the next few slides.
  • ( RootcountD(x,ri) - RootcountD(x,ri)k ) / ?xk
    where D(x,ri)k is D(x,ri-1) expanded
  • in all dimensions except k.
  • Alternatively in 3., actually calculate the mean
    (or median?) of the new points encountered in
    D(x,ri) (we have a P-tree mask for the set so
    this it trivial) and measure the xk-distance.
  • NOTE Might want to go 1 more ring out to see
    if one gets the same or a similar gradient

79
gradient
H(x,21)
H(x,21)1
HOBBit center, x(x1,b1..x1,41010,
x2,b2..x2,41010, x3,b3..x3,41010)
First new point
80
( RootcountD(x,ri) - RootcountD(x,ri)2 ) / ?x2
(2-1)/(-1) -1
gradient
H(x,21)2
H(x,21)
81
( RootcountD(x,ri) - RootcountD(x,ri)3 ) / ?x3
(2-1)/(-1) -1
gradient
H(x,21)3
H(x,21)
82
gradient
H(x,21)
HOBBit center, x(x1,b1..x1,41010,
x2,b2..x2,41010, x3,b3..x3,41010)
First new point
83
Est ?f/ ?xk(x) (RcD(x,ri) - RootcountD(x,ri)1 )
/ ?x1 (2-2)/(-1) 0
gradient
H(x,21)
H(x,21)1
HOBBit center, x(x1,b1..x1,41010,
x2,b2..x2,41010, x3,b3..x3,41010)
84
gradient
H(x,21)2
H(x,21)
85
gradient
H(x,21)3
H(x,21)
86
gradient
H(x,21)
87
H(x,21)
H(x,21)1
HOBBit center, x(x1,b1..x1,41010,
x2,b2..x2,41010, x3,b3..x3,41010)
First new point
88
( RootcountD(x,ri) - RootcountD(x,ri)2 ) / ?x2
(2-2)/(-1) 0
H(x,21)2
H(x,21)
89
( RootcountD(x,ri) - RootcountD(x,ri)3 ) / ?x3
(2-2)/(-1) 0
H(x,21)3
H(x,21)
90
Intuitively, this Gradient estimation method
seems to work. Next we consider a potential
accuracy improvement in which we take the medoid
of all new points as the gradient (or, more
accurately, as the point to which we climb in any
response surface hill climbing technique)
H(x,21)
91
Estimate the gradient arrowhead as being at the
medoid of the new point set (or, more correctly,
estimate the next hill-climb step). Note If the
original points are truly part of a strong
cluster, the hill climb will be excellent.
H(x,21)
new points s
new points centroid
92
Estimate the gradient arrowhead as being at the
medoid of the new point set (or, more correctly,
estimate the next hill-climb step). Note If the
original points are not truly part of a strong
cluster, the weak hill climb will indicate that.
H(x,21)
new points s
new points centroid
93
First new point
H(x,22)
H(x,22)1
94
To evaluate how well the formula estimates
the gradient, it is important to consider all
cases of the new point appearing in one of these
regions (if ? 1 point appears, gradient
components are additive, so it suffices to
consider 1?
H(x,2)
95
To evaluate how well the formula estimates the
gradient, it is important to consider all cases
of the new point appearing in 1 of these regions
(if ? 1 pt appears, gradient comps add)
96
(No Transcript)
97
H( x,23 )
Notice that the HOBBit center moves more and more
toward the true center as the grid size increases.
98
Grid based Gradients and Hill Climbing
  • If we are using gridding to produce the gradient
    vector field of a response surface, might we
    always vary ?xi in the positive direction only?
    How can that be done most efficiently?
  • j-lo gridding, building out HOBBit rings from
    HOBBit grid centers (see
    previous slides where this approach was used.)
    or j-lo
    gridding. building out HOBBit rings from lo-value
    grid pts (ending in j 0-bits)
  • x ( x1,b1x1,j100 , , xn,bnxn,j100
    )
  • Ordinary j-lo griddng, building out rings from
    lo-value ids (ending in j zero bits)
  • Ordinary j-lo gridding, uilding out Rings from
    true centers.
  • Other? (there are many other possibilities, but
    we will first explore 2.)
  • Using j-lo gridding with j3 and lo-value cell
    identifiers, is shown on the next slide.
  • Of course, we need not use HOBBit build out.
  • With ordinary unit radius build out, the results
    are more exact, but are the calculations may be
    more complex???

99
HOBBit j-lo rings using lo-value cell ids
x(x1,b1x1,j100 ,, xn,bnxn,j100)
100
Ordinary j-lo rings using lo-value cell ids
x(x1,b1x1,j100 ,, xn,bnxn,j100)
PDisk(x,3)PDisk(x,2)
PDisk(x,2)PDisk(x,1) wherePD(x,i)
Pxb..Pxj1 Pj..Pi1
About PowerShow.com