Loading...

PPT – MOSES CHARIKAR, CHANDRA CHEKURI, PowerPoint presentation | free to download - id: 80f641-OWJmO

The Adobe Flash plugin is needed to view this content

Incremental Clustering And Dynamic Information

Retrieval

- By
- MOSES CHARIKAR, CHANDRA CHEKURI,
- TOMAS FEDER, AND
- RAJEEV MOTWANI
- Presented By Sarah Hegab

Outline

- Motivation
- Main Problem
- Hierarchical Agglomerative Clustering
- A Model Incremental Clustering
- Different incremental algorithms
- Lower Bounds for incremental algorithms
- Dual Problem

I. Main Problem

- The clustering problem is as follows given n

points in a metric space M, partition the points

into k clusters so as to minimize the maximum

cluster diameter.

1.Greedy Incremental Clustering

- Center-Greedy
- Diameter-greedy

a) Center-Greedy

The center-greedy algorithm associates a center

for each cluster and merges the two clusters

whose centers are closest. The center of the old

cluster with the larger radius becomes the new

center Theorem The center-greedy algorithms

performance ratio has a lower bound of 2k - 1.

a) Center-Greedy cont.

- Proof
- 1-Tree Construction

0

K2

a) Center-Greedy cont.

- 2-Tree ? Graph
- Set Ai (in our example Aiv1,v2, v3,v4)

1

1-e2

1-e1

v3

v4

v5

v1

v2

S3

S2

S0

S1

S2

1

1-e3

Post-Order Traverse

a) Center-Greedy cont.

- Claims
- For 1 lt i lt 2k - 1, Ai is the set of clusters

of center-greedy which contain more than one

vertex after the k i vertices v1, . . . , vki

are given. - There is a k-clustering of G of diameter 1. The

clustering which achieves the above diameter is

S0 US1, . . . , S2k-2 US2k-1.

K4

Competitiveness of Center-Greedy

- Theorem The center-greedy algorithm has

performance ratio of 2k-1 in any metric space.

b) Diameter-Greedy

The diameter-greedy algorithm always merges those

two clusters which minimize the diameter of the

resulting merged cluster. Theorem The

diameter-greedy algorithms performance ratio

W(log(k)) is even on the line.

b) Diameter-Greedy cont.

- Proof
- 1) Assumptions
- Ui Uj1Fipij, qij, rij, sij,
- Vi Uj1Fiqij, rij,
- Wi Uj1Fipij, qij, rij,
- Xi Uj1Fipij, qij, rij, sij,
- Yi Uj1Fipij, qij, rij, sij,
- Zi Uj1Fipij, qij, rij, sij.

b) Diameter-Greedy cont.

- Proof
- 2) Invariant When the last element of Kt is

received, diameter-greedys k1 clusters are - (Ui1t-2 Zi) UYt-1U Xt (Urit1 Vi).
- Since there are k1 clusters, two of the

clusters have to be merged and the algorithm

merges two clusters in Vt1 to form a cluster of

diameter (t1). Without loss of generality, we

may assume that the clusters merged are q(t1)1

and r(t1)1.

Competitiveness of Diameter-Greedy

- Theorem For k 2, the diameter-greedy

algorithm has a performance ratio 3 in any metric

space.

2.Doubling Algorithm

- Deterministic
- Randomized
- Oblivious
- Randomized Oblivious

a) Deterministic doubling algorithm

- The algorithm works in phases
- At the start of phase i it has k1 clusters
- Uses a and b, s.t a/(1-a)ltb
- At start of phase i the following is assumed
- 1. for each cluster Cj , the radius of Cj

defined as maxp ? Cj d(cj, p) is at most adi - 2. for each pair of clusters Cj and Cl, the

inter-center distance d(cj, cl) gt di - 3. di lt opt.

a) Deterministic doubling algorithm

- Each phase has two stages
- 1- Merging stage, in which the algorithm reduces

the number of clusters by merging certain pairs - 2-Update stage, in which the algorithm accepts

new updates and tries to maintain at most k

clusters without increasing the radius of the

clusters or violating the invariants - A phase ends when number of clusters exceeds k

a) Deterministic doubling algorithm

- Definition The t-threshold graph on a set of

points P p1, p2, . . . , pn is the graph

G(P,E) such that (pi, pj) in E if and only if

d(pi, pj) lt t. - Merging stage defines di1 bdi and a graph G

di1threshold for centers c1,. . . , ck1 . - New clusters C1Cm. If mk1 this ends the

phase i

a) Deterministic doubling algorithm

- Lemma The pairwise distance between cluster

centers after the merging stage of phase i is at

least di1. - Lemma The radius of the clusters after the

merging stage of phase i is at most

di1adiltadi1 - Update continues while the number of clusters is

at most k. It is restricted by the radius bound

adi1. Then phase i ends.

a) Deterministic doubling algorithm

- Initialization the algorithm waits until k1

points have arrived then enters phase 1, with

each point as a center containing just itself.

And d1 set to the distance between the closest

pair of points

a) Deterministic doubling algorithm

- Lemma The k 1 clusters at the end of the ith

phase satisfy the following conditions - 1. The radius of the clusters is at most adi1.
- 2. The pairwise distance between the cluster

centers is at least d i1. - 3. di1 lt OPT, where OPT is the diameter of the

optimal clustering for the current set of points. - Theorem The doubling algorithm has performance

ratio 8 in any metric space.

a) Deterministic doubling algorithm

- Example to show the analysis is tight
- kgt3.
- Input consists of k3 points p1pk3
- the points p1pk1 have distance 1, pk2 ,pk3

have distance 4 from the others, and 8 from each

other.

b) Randomized doubling algorithm

- Choose a random variable r from 1/e,1 according

to the probability density function 1/r - The min pairwise distance of the first k1 point

is x. And d1rx - be, ae/(e-1)

b) Randomized doubling algorithm

- Theorem The randomized doubling algorithm has

expected performance ratio 2e in any metric

space. The same bound is also achieved for the

radius measure.

c) Oblivious clustering algorithm

- Does not need to know k
- Assume we have un upper bound on the max distance

between point which is 1. - Points are maintained in a tree

c) Oblivious clustering algorithm cont.

Root at depth 0

Within dist. 1/2i-1 from parent

At distance greater than 1/2i

Where i is the depth of the vertex , igt0

Illustration Of Oblivious clustering algorithm

c) Oblivious clustering algorithm cont.

- How do we obtain the k clusters from the tree?
- If k is given, and i is the greatest depth

containing at most k points. - These are the k cluster centers. The sub-trees of

the vertices at depth i are the clusters. - As points are added, the number of vertices at

depth i increases if it goes beyond k, then we

change i to i - 1, collapsing certain clusters

otherwise, the new point is inserted in one of

the existing clusters.

c) Oblivious clustering algorithm cont.

- Theorem The algorithm that outputs the k

clusters obtained from the tree construction has

performance ratio 8 for the diameter measure and

the radius measure. - Optimal diameter is ½i1 lt d lt ½I
- Then points at depth i are in different clusters,

so there are at most k of them. - jgti be the greatest depth containing at most k

points. - Subtrees are at a distance of the root within ½j

½j1 ½j2 lt ½j-1lt 4d.

d) Randomized Oblivious

- Distance threshold for depth i is r/ei
- r chosen once at random from 1,e, according to

the PDF 1/r - The expected diameter is at most 2e.OPT diameter

Lower Bounds

- Theorem1 For k 2, there is a lower bound of 2

and 2 - ½k/2 on the performance ratio of

deterministic and randomized algorithms,

respectively, for incremental clustering on the

line.

Lower Bounds cont.

- Theorem2 There is a lower bound of 121/2 on

the performance ratio of any deterministic

incremental clustering algorithm for arbitrary

metric spaces.

Lower Bounds cont.

Lower Bounds cont.

- Theorem3 For any egt0 and kgt2, there is a lower

bound of 2 - e on the performance ratio of any

randomized incremental algorithm.

Lower Bounds cont.

- Theorem4 For the radius measure, no

deterministic incremental clustering algorithm

has a performance ratio better than 3 and no

randomized algorithm has a ratio better than 3

e for any fixed e gt 0.

II. Dual Problem

- For a sequence of points p1,p2,...,pn?Rd, cover

each point with a unit ball in d as it arrives,

so as to minimize the total number of balls used.

II. Dual Problem

- Rogers Theorem Rd can be covered by any convex

shape with covering density O(d log d). - Theorem For the dual clustering problem in Rd,

there is an incremental algorithm with

performance ratio O(2dd log d). - Theorem For the dual clustering problem in d,

any incremental algorithm must have performance

ratio W( (log d)/(log log log d) ).

Thank You