Clustering%20Data%20Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering%20Data%20Streams

Description:

Clustering Data Streams A presentation by George Toderici Talk Outline Goals of the paper Notation reminder Clustering With Little Memory Data Stream Model Clustering ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 47
Provided by: Geor4186
Learn more at: http://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Clustering%20Data%20Streams


1
Clustering Data Streams
  • A presentation by George Toderici

2
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

3
Goals of the paper
  • Since the k-Median problem is NP-hard this paper
    attempts to create an approximation with the
    following constraints
  • Minimize memory usage
  • Minimize CPU usage
  • Work both on metric spaces and the special case
    of Euclidean space

4
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

5
Notation Reminder
O(g(n)) running time is upper bounded by
g(n) O(g(n)) running time is lower bounded by
g(n) o(g(n)) running time is asymptotically
negligible ?(g(n)) memory usage is upper
bounded by g(n) not commonly used Soft-Oh
6
Paper-specific Notation
  • cij is the distance between points i and j
  • di the number of points associated with median i
  • NOTE Do not confuse c and d. Presumably the
    distance has been chosen to be cij because
    distance can be treated as a cost. It would
    have been more intuitive to have it called d
    from the word distance.

7
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

8
Clustering with little memory
  • Algorithm SmallSpace(S)
  • Divide S into l disjoint pieces X1Xl
  • For each Xi find O(k) centers in it. Assign each
    point to its closest center.
  • Let X be the set of O(lk) centers obtained where
    each center is weighed by the number of points
    assigned to it
  • Cluster X to find k centers

9
SmallSpace (2)
K
10
SmallSpace analysis
  • Since we are interested in using as little memory
    as possible, l has to be chosen so that both each
    partition of S and X fit in main memory.
    However, no such l may exist if S is very large.
  • We will use this algorithm as a starting point
    and improve it so that it will satisfy all
    requirements.

11
Theorem 1
  • Given an instance of the k-median problem with a
    solution of cost C, where the medians may not
    belong to the set of input points, there exists a
    solution of cost 2C where all the medians belong
    to the set of input points (metric space
    requirement).

12
Theorem 1 Proof
  • Consider the figure
  • The distance from (4) closest to the true mean
    to any other point (i) in the data is bounded by
    cimcm4 triangle inequality
  • Therefore, the maximum cost for the median will
    be at most two times the cost of the median
    clustering with no constraints (worst case)

13
Theorem 2
  • Consider a set of n points partitioned into
    x1,,xl (disjoint sets). The sum of the optimum
    solution values for the k-median problem on the l
    sets of points is at most twice the cost of the
    optimum k-median problem solution for all n
    points.

14
Theorem 2 Proof
  • This is Theorem 1, but on l clusters.
  • Apply theorem 2 l times, and obtain a maximum
    cost which is two times the cost in the case when
    it is allowed to have medians which are not part
    of the data

15
Theorem 3 (SmallSize Step 2)
  • If the sum of the costs of the l optimum
    k-median solutions for x1,,xl is C and if C is
    the cost of the optimum k-median solution on the
    entire set S, then there exists a solution of
    cost at most 2(CC) to a the new weighted
    instance X.

16
Theorem 3 Proof (1)
  • Let i be a point in X (a median obtained by
    SmallSpaces)
  • Let the point to which i is assigned to in the
    optimum continuous solution be ?(i), and the
    number of points assigned to i be di
  • Then the cost of X is

17
Theorem 3 Proof (2)
  • Let i be a point in the set S. then let i(i) be
    the median in X to which it was assigned by
    SmallSpace.
  • Then the cost of X can be written as
  • Let the median assigned to i in the optimal
    continuous solution on S be ?(i)

18
Theorem 3 Proof (3)
  • Because ? is optimal for X, the cost is no more
    than
  • The last sum evaluates to C C for the
    continuous case or 2(C C) in the metric space
    case
  • Reminder The sum of the costs of the l optimum
    k-median solutions for x1,,xl is C and C is the
    cost of the optimum k-median solution on the
    entire set S

19
Theorem 4 (SmallSize step 2, 4)
  • If we modify step 2 to use a bicriteria
    approximation algorithm (a,b) where at most ak
    medians are output with a cost of at most b times
    the optimal k-Median solutions, and then
  • Modify Step 4 to run a c-approximation algorithm,
    then
  • Theorem 4 The algorithm SmallSpace has an
    approximation factor of 2c(1b)2b not proven
    here

20
SmallerSpace
  • Algorithm SmallerSpace(S,i)
  • Divide S into l disjoint pieces X1Xl
  • For each Xi find O(k) centers in it. Assign each
    point to its closest center.
  • Let X be the O(lk) centers obtained in (2) where
    each center is weighed by the number of points
    assigned to it
  • Call SmallerSpace(X, i-1)

21
SmallerSpace 2
A small factor is lost in the approximation with
each level of divide and conquer
  • In general, if Memoryne, need 1/e levels,
    approximation factor 2O(1/e)
  • If n1012 and M106, then regular 2-level
    algorithm
  • If n1012 and M103 then need 4 levels,
    approximation factor 24

k

22
SmallerSpace Analysis
  • Theorem 5 For a constant i, SmallerSpace(S,i)
    gives a constant factor approximation to the
    k-median problem.
  • Proof The approximation at level j is
    Aj2Aj-1(2b1) 2b (Theorem 2,4) which has the
    solution Ajc(2(b1))j which is O(1) if j is
    constant.

23
SmallerSpace Analysis (2)
  • Then, since all intermediate medians X must be
    stored in memory, the number of subsets l that we
    partition S into is limited.
  • In fact, we need lk lt M, and such an l may not
    exist (where M is the memory size)

24
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

25
Datastream model
  • Datastream set of ordered points x1,,xi ,, xn
  • Algorithm performance is measured as the number
    of passes on the data given the constraints of
    available memory
  • Usually the number of points is extremely large
    so it is impossible to fit all of them in memory
  • Usually once a point has been read it is very
    expensive to read it again. Most algorithms
    assume the data will not be available for a
    second pass.

26
Data Stream Algorithm
  1. Input the first m points use a bicriterion
    algorithm to reduce these to O(k) (e.g., 2k)
    points. Weigh each intermediate median by the
    number of points assign to it. (depending on
    algorithm used this can take O(m2) or O(mk))
  2. Repeat (1) until we have seen m2/(2k) of the
    original data points.
  3. Cluster these m first-level medians into 2k
    second-level medians

27
Data Stream Algorithm (2)
  • Maintain at most m level-i medians, and on seeing
    m, generate 2k level-i1 medians with the weight
    of the new median as the sum of the weights of
    the intermediate medians.
  • When we have seen all data points or when we
    decide to stop we cluster all intermediate
    medians into k final medians

28
Data Stream Algorithm (3)
Level 2
M-gtK
2k

Level 3
Level i
M-gtK
M-gtK
2k
FinalK
2k



29
Data Stream Algorithm Analysis
  • The algorithm requires O(log(n/m)log(m/k)) levels
  • If k much smaller than m, and m O(n?) for ? lt
    1
  • ?(n?) space
  • O(n1 ?) run time
  • up to a O(21/?) approximation factor (constant
    factor approximation)

30
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

31
Randomized Algorithm
  1. Draw a sample of size s (nk)1/2
  2. Find k medians from these s points using a primal
    dual algorithm
  3. Assign each of the original points to its closest
    median
  4. Collect n/s points with the largest assignment
    distance
  5. Find k medians from among these n/s points
  6. At this point we have 2k medians

32
Randomized Algorithm Analysis
  • The algorithm gives a O(1) approximation with 2k
    medians with constant probability.
  • O(log n) passes for high probability results
  • time and space
  • Space can be improved to O((nk)1/2)

33
Full Algorithm
  1. Input the first O(M/k) points then use the
    randomized algorithm to find 2k intermediate
    median points
  2. Use a local search algorithm to cluster O(M)
    intermediate median points of level i to 2k
    medians of level i1
  3. Use the primal dual algorithm to cluster the
    final O(k) medians into k medians

34
Full Algorithm (2)
  • The full algorithm is still one pass (we call the
    randomized algorithm only once per input set)
  • Step 1 is
  • Step 2 is O(nk)
  • Therefore, the final cost is

35
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

36
Lower Bounds
  • Consider a clustering where the distance between
    two points is 1 if they belong to the same
    cluster and 0 otherwise
  • An algorithm is not constant factor if it does
    not discover a clustering of cost 0
  • Finding such a clustering is equivalent to the
    following in a complete k-partite graph G for
    some k, find the k-partition of vertices of G
    into independent sets.
  • The best algorithm to find that requires ?(nk)
    queries and therefore lower bounds any c.f.
    clustering algorithm

37
Deterministic Algorithms A1
  1. Partition the n original points into p1 subsets
  2. Apply the primal dual algorithm to each subset
    (O(an2) for each)
  3. Apply it again to the p1k weighted points
    obtained at (2) to get the final k medians

38
A1 Details
  • If we choose the number of subsets p1 (n/k)2/3
    we have
  • O(n4/3k2/3) runtime and space
  • 4c2 4c approximation factor by Theorem 4, where
    c is the approximation given by the primal-dual
    algorithm

39
Deterministic Algorithms A2
  1. Split the dataset into p2 partitions
  2. Apply A1 on each of them
  3. Apply A1 on all the intermediate medians at (2)

40
A2 Details
  • If we choose the number of subsets p1 (n/k)4/5
    in order to minimize the running time we have
  • O(n16/15k14/15) runtime and space
  • We can see a trend!

41
Deterministic Algorithm
  • Create algorithm Ai that calls Ai-1 on pi
    partitions
  • Then the complexity in both time and space of
    this algorithm will be

42
Deterministic Algorithm (2)
  • The approximation factor grows with i, however
  • We can set i?(log log log n) in order to get the
    exponent of n in the running time to be 1.

43
Deterministic Algorithm (2)
  • This gives an algorithm running in
  • space and time.

44
Talk Outline
  1. Goals of the paper
  2. Notation reminder
  3. Clustering With Little Memory
  4. Data Stream Model
  5. Clustering with Data Streams
  6. Lower Bounds and Deterministic Algorithms
  7. Conclusion

45
Conclusion
  • We have presented a variety of algorithms
    optimized to address the problem of clustering in
    systems where the amount of data is huge
  • All the algorithms presented are just
    approximations to the k-means problem

46
References
  1. Eric W. Weisstein. "Complete k-Partite Graph."
    From MathWorld--A Wolfram Web Resource.
    http//mathworld.wolfram.com/Completek-PartiteGrap
    h.html
  2. http//theory.stanford.edu/nmishra/CS361-2002/lec
    ture9-nina.ppt
Write a Comment
User Comments (0)
About PowerShow.com