MOSES CHARIKAR, CHANDRA CHEKURI, - PowerPoint PPT Presentation

Loading...

PPT – MOSES CHARIKAR, CHANDRA CHEKURI, PowerPoint presentation | free to download - id: 80f641-OWJmO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

MOSES CHARIKAR, CHANDRA CHEKURI,

Description:

By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab – PowerPoint PPT presentation

Number of Views:7
Avg rating:3.0/5.0
Slides: 40
Provided by: aaa1171
Learn more at: http://www.cas.mcmaster.ca
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: MOSES CHARIKAR, CHANDRA CHEKURI,


1
Incremental Clustering And Dynamic Information
Retrieval
  • By
  • MOSES CHARIKAR, CHANDRA CHEKURI,
  • TOMAS FEDER, AND
  • RAJEEV MOTWANI
  • Presented By Sarah Hegab

2
Outline
  • Motivation
  • Main Problem
  • Hierarchical Agglomerative Clustering
  • A Model Incremental Clustering
  • Different incremental algorithms
  • Lower Bounds for incremental algorithms
  • Dual Problem

3
I. Main Problem
  • The clustering problem is as follows given n
    points in a metric space M, partition the points
    into k clusters so as to minimize the maximum
    cluster diameter.

4
1.Greedy Incremental Clustering
  • Center-Greedy
  • Diameter-greedy

5
a) Center-Greedy
The center-greedy algorithm associates a center
for each cluster and merges the two clusters
whose centers are closest. The center of the old
cluster with the larger radius becomes the new
center Theorem The center-greedy algorithms
performance ratio has a lower bound of 2k - 1.
6
a) Center-Greedy cont.
  • Proof
  • 1-Tree Construction

0
K2
7
a) Center-Greedy cont.
  • 2-Tree ? Graph
  • Set Ai (in our example Aiv1,v2, v3,v4)

1
1-e2
1-e1
v3
v4
v5
v1
v2
S3
S2
S0
S1
S2
1
1-e3
8
Post-Order Traverse
9
a) Center-Greedy cont.
  • Claims
  • For 1 lt i lt 2k - 1, Ai is the set of clusters
    of center-greedy which contain more than one
    vertex after the k i vertices v1, . . . , vki
    are given.
  • There is a k-clustering of G of diameter 1. The
    clustering which achieves the above diameter is
    S0 US1, . . . , S2k-2 US2k-1.

10
K4
11
Competitiveness of Center-Greedy
  • Theorem The center-greedy algorithm has
    performance ratio of 2k-1 in any metric space.

12
b) Diameter-Greedy
The diameter-greedy algorithm always merges those
two clusters which minimize the diameter of the
resulting merged cluster. Theorem The
diameter-greedy algorithms performance ratio
W(log(k)) is even on the line.
13
b) Diameter-Greedy cont.
  • Proof
  • 1) Assumptions
  • Ui Uj1Fipij, qij, rij, sij,
  • Vi Uj1Fiqij, rij,
  • Wi Uj1Fipij, qij, rij,
  • Xi Uj1Fipij, qij, rij, sij,
  • Yi Uj1Fipij, qij, rij, sij,
  • Zi Uj1Fipij, qij, rij, sij.

14
b) Diameter-Greedy cont.
  • Proof
  • 2) Invariant When the last element of Kt is
    received, diameter-greedys k1 clusters are
  • (Ui1t-2 Zi) UYt-1U Xt (Urit1 Vi).
  • Since there are k1 clusters, two of the
    clusters have to be merged and the algorithm
    merges two clusters in Vt1 to form a cluster of
    diameter (t1). Without loss of generality, we
    may assume that the clusters merged are q(t1)1
    and r(t1)1.

15
Competitiveness of Diameter-Greedy
  • Theorem For k 2, the diameter-greedy
    algorithm has a performance ratio 3 in any metric
    space.

16
2.Doubling Algorithm
  1. Deterministic
  2. Randomized
  3. Oblivious
  4. Randomized Oblivious

17
a) Deterministic doubling algorithm
  • The algorithm works in phases
  • At the start of phase i it has k1 clusters
  • Uses a and b, s.t a/(1-a)ltb
  • At start of phase i the following is assumed
  • 1. for each cluster Cj , the radius of Cj
    defined as maxp ? Cj d(cj, p) is at most adi
  • 2. for each pair of clusters Cj and Cl, the
    inter-center distance d(cj, cl) gt di
  • 3. di lt opt.

18
a) Deterministic doubling algorithm
  • Each phase has two stages
  • 1- Merging stage, in which the algorithm reduces
    the number of clusters by merging certain pairs
  • 2-Update stage, in which the algorithm accepts
    new updates and tries to maintain at most k
    clusters without increasing the radius of the
    clusters or violating the invariants
  • A phase ends when number of clusters exceeds k

19
a) Deterministic doubling algorithm
  • Definition The t-threshold graph on a set of
    points P p1, p2, . . . , pn is the graph
    G(P,E) such that (pi, pj) in E if and only if
    d(pi, pj) lt t.
  • Merging stage defines di1 bdi and a graph G
    di1threshold for centers c1,. . . , ck1 .
  • New clusters C1Cm. If mk1 this ends the
    phase i

20
a) Deterministic doubling algorithm
  • Lemma The pairwise distance between cluster
    centers after the merging stage of phase i is at
    least di1.
  • Lemma The radius of the clusters after the
    merging stage of phase i is at most
    di1adiltadi1
  • Update continues while the number of clusters is
    at most k. It is restricted by the radius bound
    adi1. Then phase i ends.

21
a) Deterministic doubling algorithm
  • Initialization the algorithm waits until k1
    points have arrived then enters phase 1, with
    each point as a center containing just itself.
    And d1 set to the distance between the closest
    pair of points

22
a) Deterministic doubling algorithm
  • Lemma The k 1 clusters at the end of the ith
    phase satisfy the following conditions
  • 1. The radius of the clusters is at most adi1.
  • 2. The pairwise distance between the cluster
    centers is at least d i1.
  • 3. di1 lt OPT, where OPT is the diameter of the
    optimal clustering for the current set of points.
  • Theorem The doubling algorithm has performance
    ratio 8 in any metric space.

23
a) Deterministic doubling algorithm
  • Example to show the analysis is tight
  • kgt3.
  • Input consists of k3 points p1pk3
  • the points p1pk1 have distance 1, pk2 ,pk3
    have distance 4 from the others, and 8 from each
    other.

24
b) Randomized doubling algorithm
  • Choose a random variable r from 1/e,1 according
    to the probability density function 1/r
  • The min pairwise distance of the first k1 point
    is x. And d1rx
  • be, ae/(e-1)

25
b) Randomized doubling algorithm
  • Theorem The randomized doubling algorithm has
    expected performance ratio 2e in any metric
    space. The same bound is also achieved for the
    radius measure.

26
c) Oblivious clustering algorithm
  • Does not need to know k
  • Assume we have un upper bound on the max distance
    between point which is 1.
  • Points are maintained in a tree

27
c) Oblivious clustering algorithm cont.
Root at depth 0
Within dist. 1/2i-1 from parent
At distance greater than 1/2i
Where i is the depth of the vertex , igt0
28
Illustration Of Oblivious clustering algorithm
29
c) Oblivious clustering algorithm cont.
  • How do we obtain the k clusters from the tree?
  • If k is given, and i is the greatest depth
    containing at most k points.
  • These are the k cluster centers. The sub-trees of
    the vertices at depth i are the clusters.
  • As points are added, the number of vertices at
    depth i increases if it goes beyond k, then we
    change i to i - 1, collapsing certain clusters
    otherwise, the new point is inserted in one of
    the existing clusters.

30
c) Oblivious clustering algorithm cont.
  • Theorem The algorithm that outputs the k
    clusters obtained from the tree construction has
    performance ratio 8 for the diameter measure and
    the radius measure.
  • Optimal diameter is ½i1 lt d lt ½I
  • Then points at depth i are in different clusters,
    so there are at most k of them.
  • jgti be the greatest depth containing at most k
    points.
  • Subtrees are at a distance of the root within ½j
    ½j1 ½j2 lt ½j-1lt 4d.

31
d) Randomized Oblivious
  • Distance threshold for depth i is r/ei
  • r chosen once at random from 1,e, according to
    the PDF 1/r
  • The expected diameter is at most 2e.OPT diameter

32
Lower Bounds
  • Theorem1 For k 2, there is a lower bound of 2
    and 2 - ½k/2 on the performance ratio of
    deterministic and randomized algorithms,
    respectively, for incremental clustering on the
    line.

33
Lower Bounds cont.
  • Theorem2 There is a lower bound of 121/2 on
    the performance ratio of any deterministic
    incremental clustering algorithm for arbitrary
    metric spaces.

34
Lower Bounds cont.
35
Lower Bounds cont.
  • Theorem3 For any egt0 and kgt2, there is a lower
    bound of 2 - e on the performance ratio of any
    randomized incremental algorithm.

36
Lower Bounds cont.
  • Theorem4 For the radius measure, no
    deterministic incremental clustering algorithm
    has a performance ratio better than 3 and no
    randomized algorithm has a ratio better than 3
    e for any fixed e gt 0.

37
II. Dual Problem
  • For a sequence of points p1,p2,...,pn?Rd, cover
    each point with a unit ball in d as it arrives,
    so as to minimize the total number of balls used.

38
II. Dual Problem
  • Rogers Theorem Rd can be covered by any convex
    shape with covering density O(d log d).
  • Theorem For the dual clustering problem in Rd,
    there is an incremental algorithm with
    performance ratio O(2dd log d).
  • Theorem For the dual clustering problem in d,
    any incremental algorithm must have performance
    ratio W( (log d)/(log log log d) ).

39
Thank You
About PowerShow.com