Title: On the Lower Bound of Local Optimum in KMeans Algorithm
1On the Lower Bound of Local Optimum in K-Means
Algorithm
- Zhang Zhenjie, Dai Bing Tian, and Anthony K.H.
Tung
2Outline
- Introduction
- Maximal Region
- Algorithms
- Experiments
- Conclusion and Future Work
3Introduction
- K-Means Algorithm
- Pick k centers randomly
- K-Means Iterations
- Assign every point to the closest center
- Compute the center of every cluster to replace
the old one
- Stop the algorithm if the centers are stable
4Introduction (cont.)
- Cost
- Sum of the squared distance from every point to
its closest center
- Cost decreases after every k-means iteration
- Global Optimum
- Centers minimizing the cost
- Local Optimum
- Centers outputted by k-means with any initial
centers
5Introduction (cont.)
- Disadvantages of Local Optimum
- Much worse than global optimum
- Re-run the algorithm with different initial
centers
- Leads to the waste of computation resource
- Solution?
- Find center set leading to global optimum?
- Detect local optimum with large cost as early as
possible? (the target of our paper)
6Introduction (cont.)
- A simple solution for early detection
Cost
Stop when the decrease of cost is small
after one iteration
Iteration
7Introduction (cont.)
- A simple solution for early detection
Cost
A much better local optimum is missed
Iteration
8Introduction (cont.)
Cost
Lower bound is derived and used to guess
the potential of the current clustering
Iteration
9Introduction (cont.)
Cost
If the yellow curve represents the current
best solution, we can stop the computation here
Iteration
10Outline
- Introduction
- Maximal Region
- Algorithms
- Experiments
- Conclusion and Future Work
11Solution Space
- Given a d-dimensional problem space, we define
the solution space as a kd-dimensional space
c2
M1
c1
c2
c1
12Solution Space
- With the iterations, the center set jumps in the
solution space
M3
M2
c2
M1
c1
c2
c1
13Definition of Maximal Region
- Maximal Region is a region in the solution space,
covering the local optimum achieved by future
iterations
- Two problems
- How to find such a maximal region
- How to lower bound the cost of any solution in
the maximal region
14Maximal Region
The cost of center sets in solutions space
is represented by contour lines, lighter color
meaning smaller cost
c2
M2
Any solution between M1 and M2 must have
smaller cost than M1
M1
c1
15Maximal Region
c2
M2
M1
Maximal Region of the local optimum, the
local optimum must locate in
c1
16Maximal Region
- A region is maximal region for center set M, if
- It contains M
- Any solution on the boundary of the region has
equal cost of M
17A Special Maximal Region
c2
M2
M1
every center moves no more than Delta
c1
18Maximal Region
M1
m1
m2
19Costs in Maximal Region
- Bounding Theorem
- Any solution in must have
cost no less than C(M1)-DeltaN, where C(M1) is
the cost of M1 and N is the size of the data set
20Outline
- Introduction
- Maximal Region
- Algorithms
- Experiments
- Conclusion and Future Work
21Algorithm
- New Algorithm
- Same Initial Centers Selection
- New Iteration
- Reassignment
- Computing new centers, M
- Finding the smallest R(M,Delta)
- Computing the lower bound in maximal region
- Check the stopping criteria or prune the current
procedure
22Finding Smallest Delta
- The value of Delta can be any float value
- Divide the search range into N1 segments,
0,a(1),a(1),a(2),a(N),infinity)
- Search the segments from 0,a(1) in order
- On every segment, solving a quadratic equations.
- If any plausible quadratic root is found, return
as the smallest Delta
23Algorithm
- Finding the smallest Delta to bound the local
optimum in the Maximal Region
- Sorting and Scan Algorithm
- Complexity is O(Nlog N), N is the size of the
data
- Lower bounding the cost of local optimum
- Simple computation
- Done in O(1) time
24Outline
- Introduction
- Maximal Region
- Algorithms
- Experiments
- Conclusion and Future Work
25Experiments
- Data Set
- Synthetic data sets and KDD99 data set
- Original K-Means Algorithm (OKM)
- Accelerated K-means Algorithm (AKM)
- Run k-means clustering several times
- The best result of the previous runs is used to
prune the following runs
26Experiments
- Measurement
- We use the same random seeds for OKM and AKM
- iterations (I/O cost) and computation time (CPU
cost)
27Experiments (cont.)
- Varying dimensionality on synthetic data sets
28Experiments (cont.)
- Varying k on synthetic data sets
29Experiments (cont.)
- Varying k on KDD99 data set
30Conclusion and Future Work
- Contribution
- Lower bound of Local Optimum in K-Means
Algorithm
- The concept of Maximal Region
- Algorithm for finding Maximal Region
- Accelerate K-Means Algorithm
31Conclusion and Future Work
- Additional Applications
- Data stream clustering
- Real time cluster analysis over moving objects
- Improvement
- Some tighter bound
- Extension to general clustering algorithms
32Q A