Clustering%20Data%20Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering%20Data%20Streams

Description:

Clustering Data Streams A presentation by George Toderici Talk Outline Goals of the paper Notation reminder Clustering With Little Memory Data Stream Model Clustering ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 47

Provided by: Geor4186

Learn more at: http://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering%20Data%20Streams

1
Clustering Data Streams

A presentation by George Toderici

2
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

3
Goals of the paper

Since the k-Median problem is NP-hard this paper
attempts to create an approximation with the
following constraints
Minimize memory usage
Minimize CPU usage
Work both on metric spaces and the special case
of Euclidean space

4
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

5
Notation Reminder
O(g(n)) running time is upper bounded by
g(n) O(g(n)) running time is lower bounded by
g(n) o(g(n)) running time is asymptotically
negligible ?(g(n)) memory usage is upper
bounded by g(n) not commonly used Soft-Oh
6
Paper-specific Notation

cij is the distance between points i and j
di the number of points associated with median i
NOTE Do not confuse c and d. Presumably the
distance has been chosen to be cij because
distance can be treated as a cost. It would
have been more intuitive to have it called d
from the word distance.

7
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

8
Clustering with little memory

Algorithm SmallSpace(S)
Divide S into l disjoint pieces X1Xl
For each Xi find O(k) centers in it. Assign each
point to its closest center.
Let X be the set of O(lk) centers obtained where
each center is weighed by the number of points
assigned to it
Cluster X to find k centers

9
SmallSpace (2)
K
10
SmallSpace analysis

Since we are interested in using as little memory
as possible, l has to be chosen so that both each
partition of S and X fit in main memory.
However, no such l may exist if S is very large.
We will use this algorithm as a starting point
and improve it so that it will satisfy all
requirements.

11
Theorem 1

Given an instance of the k-median problem with a
solution of cost C, where the medians may not
belong to the set of input points, there exists a
solution of cost 2C where all the medians belong
to the set of input points (metric space
requirement).

12
Theorem 1 Proof

Consider the figure
The distance from (4) closest to the true mean
to any other point (i) in the data is bounded by
cimcm4 triangle inequality
Therefore, the maximum cost for the median will
be at most two times the cost of the median
clustering with no constraints (worst case)

13
Theorem 2

Consider a set of n points partitioned into
x1,,xl (disjoint sets). The sum of the optimum
solution values for the k-median problem on the l
sets of points is at most twice the cost of the
optimum k-median problem solution for all n
points.

14
Theorem 2 Proof

This is Theorem 1, but on l clusters.
Apply theorem 2 l times, and obtain a maximum
cost which is two times the cost in the case when
it is allowed to have medians which are not part
of the data

15
Theorem 3 (SmallSize Step 2)

If the sum of the costs of the l optimum
k-median solutions for x1,,xl is C and if C is
the cost of the optimum k-median solution on the
entire set S, then there exists a solution of
cost at most 2(CC) to a the new weighted
instance X.

16
Theorem 3 Proof (1)

Let i be a point in X (a median obtained by
SmallSpaces)
Let the point to which i is assigned to in the
optimum continuous solution be ?(i), and the
number of points assigned to i be di
Then the cost of X is

17
Theorem 3 Proof (2)

Let i be a point in the set S. then let i(i) be
the median in X to which it was assigned by
SmallSpace.
Then the cost of X can be written as
Let the median assigned to i in the optimal
continuous solution on S be ?(i)

18
Theorem 3 Proof (3)

Because ? is optimal for X, the cost is no more
than
The last sum evaluates to C C for the
continuous case or 2(C C) in the metric space
case
Reminder The sum of the costs of the l optimum
k-median solutions for x1,,xl is C and C is the
cost of the optimum k-median solution on the
entire set S

19
Theorem 4 (SmallSize step 2, 4)

If we modify step 2 to use a bicriteria
approximation algorithm (a,b) where at most ak
medians are output with a cost of at most b times
the optimal k-Median solutions, and then
Modify Step 4 to run a c-approximation algorithm,
then
Theorem 4 The algorithm SmallSpace has an
approximation factor of 2c(1b)2b not proven
here

20
SmallerSpace

Algorithm SmallerSpace(S,i)
Divide S into l disjoint pieces X1Xl
For each Xi find O(k) centers in it. Assign each
point to its closest center.
Let X be the O(lk) centers obtained in (2) where
each center is weighed by the number of points
assigned to it
Call SmallerSpace(X, i-1)

21
SmallerSpace 2
A small factor is lost in the approximation with
each level of divide and conquer

In general, if Memoryne, need 1/e levels,
approximation factor 2O(1/e)
If n1012 and M106, then regular 2-level
algorithm
If n1012 and M103 then need 4 levels,
approximation factor 24

k

22
SmallerSpace Analysis

Theorem 5 For a constant i, SmallerSpace(S,i)
gives a constant factor approximation to the
k-median problem.
Proof The approximation at level j is
Aj2Aj-1(2b1) 2b (Theorem 2,4) which has the
solution Ajc(2(b1))j which is O(1) if j is
constant.

23
SmallerSpace Analysis (2)

Then, since all intermediate medians X must be
stored in memory, the number of subsets l that we
partition S into is limited.
In fact, we need lk lt M, and such an l may not
exist (where M is the memory size)

24
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

25
Datastream model

Datastream set of ordered points x1,,xi ,, xn
Algorithm performance is measured as the number
of passes on the data given the constraints of
available memory
Usually the number of points is extremely large
so it is impossible to fit all of them in memory
Usually once a point has been read it is very
expensive to read it again. Most algorithms
assume the data will not be available for a
second pass.

26
Data Stream Algorithm

Input the first m points use a bicriterion
algorithm to reduce these to O(k) (e.g., 2k)
points. Weigh each intermediate median by the
number of points assign to it. (depending on
algorithm used this can take O(m2) or O(mk))
Repeat (1) until we have seen m2/(2k) of the
original data points.
Cluster these m first-level medians into 2k
second-level medians

27
Data Stream Algorithm (2)

Maintain at most m level-i medians, and on seeing
m, generate 2k level-i1 medians with the weight
of the new median as the sum of the weights of
the intermediate medians.
When we have seen all data points or when we
decide to stop we cluster all intermediate
medians into k final medians

28
Data Stream Algorithm (3)
Level 2
M-gtK
2k

Level 3
Level i
M-gtK
M-gtK
2k
FinalK
2k

29
Data Stream Algorithm Analysis

The algorithm requires O(log(n/m)log(m/k)) levels
If k much smaller than m, and m O(n?) for ? lt
1
?(n?) space
O(n1 ?) run time
up to a O(21/?) approximation factor (constant
factor approximation)

30
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

31
Randomized Algorithm

Draw a sample of size s (nk)1/2
Find k medians from these s points using a primal
dual algorithm
Assign each of the original points to its closest
median
Collect n/s points with the largest assignment
distance
Find k medians from among these n/s points
At this point we have 2k medians

32
Randomized Algorithm Analysis

The algorithm gives a O(1) approximation with 2k
medians with constant probability.
O(log n) passes for high probability results
time and space
Space can be improved to O((nk)1/2)

33
Full Algorithm

Input the first O(M/k) points then use the
randomized algorithm to find 2k intermediate
median points
Use a local search algorithm to cluster O(M)
intermediate median points of level i to 2k
medians of level i1
Use the primal dual algorithm to cluster the
final O(k) medians into k medians

34
Full Algorithm (2)

The full algorithm is still one pass (we call the
randomized algorithm only once per input set)
Step 1 is
Step 2 is O(nk)
Therefore, the final cost is

35
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

36
Lower Bounds

Consider a clustering where the distance between
two points is 1 if they belong to the same
cluster and 0 otherwise
An algorithm is not constant factor if it does
not discover a clustering of cost 0
Finding such a clustering is equivalent to the
following in a complete k-partite graph G for
some k, find the k-partition of vertices of G
into independent sets.
The best algorithm to find that requires ?(nk)
queries and therefore lower bounds any c.f.
clustering algorithm

37
Deterministic Algorithms A1

Partition the n original points into p1 subsets
Apply the primal dual algorithm to each subset
(O(an2) for each)
Apply it again to the p1k weighted points
obtained at (2) to get the final k medians

38
A1 Details

If we choose the number of subsets p1 (n/k)2/3
we have
O(n4/3k2/3) runtime and space
4c2 4c approximation factor by Theorem 4, where
c is the approximation given by the primal-dual
algorithm

39
Deterministic Algorithms A2

Split the dataset into p2 partitions
Apply A1 on each of them
Apply A1 on all the intermediate medians at (2)

40
A2 Details

If we choose the number of subsets p1 (n/k)4/5
in order to minimize the running time we have
O(n16/15k14/15) runtime and space
We can see a trend!

41
Deterministic Algorithm

Create algorithm Ai that calls Ai-1 on pi
partitions
Then the complexity in both time and space of
this algorithm will be

42
Deterministic Algorithm (2)

The approximation factor grows with i, however
We can set i?(log log log n) in order to get the
exponent of n in the running time to be 1.

43
Deterministic Algorithm (2)

This gives an algorithm running in
space and time.

44
Talk Outline

Goals of the paper
Notation reminder
Clustering With Little Memory
Data Stream Model
Clustering with Data Streams
Lower Bounds and Deterministic Algorithms
Conclusion

45
Conclusion

We have presented a variety of algorithms
optimized to address the problem of clustering in
systems where the amount of data is huge
All the algorithms presented are just
approximations to the k-means problem

46
References

Eric W. Weisstein. "Complete k-Partite Graph."
From MathWorld--A Wolfram Web Resource.
http//mathworld.wolfram.com/Completek-PartiteGrap
h.html
http//theory.stanford.edu/nmishra/CS361-2002/lec
ture9-nina.ppt

Write a Comment

User Comments (0)