Clustering Uncertain Data - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering Uncertain Data

Description:

Number of Views:212

Avg rating:3.0/5.0

Slides: 26

Provided by: non8106

Category:

Tags: clustering | data | uncertain

Transcript and Presenter's Notes

Title: Clustering Uncertain Data

1
Clustering Uncertain Data

Data Clustering is used to discover any cluster
patterns in a data set, e.g. the data set may be
partitioned into several groups, or clusters,
such that the data within the same cluster are
closer to each other or more similar (based on
some distance functions) than the data from any
other clusters.
There are many methods for clustering data, and
K-means clustering is a common one.

K-means clustering considers each cluster to have
a representative and it is the mean of the data
in the cluster.
For an example, consider a database of location
data reported from moving vehicles in a tracking
system.

Given K number of location points (e.g. US White
House, a school, etc.) around the data for an
initial guess of the representatives of the
clusters expected to exist in the data.
K-means clustering assigns each vehicle to one
of the K clusters such that its location is
closer in Euclidean distance to that cluster's
representative than any others' representatives.

Then the representative of each cluster is
updated to the mean of the locations of the
vehicles in the cluster. And each vehicle is
re-assigned to the K clusters with the new
representatives.
This process repeats until some objectives is
met, e.g. no changes of any vehicles' clusters
between two successive processes.

After the clustering, a cluster could be empty. A
non-empty cluster may have some meaning, e.g. if
its representative point is very close to US
White House, the vehicles in the cluster may be
classified as spies.
Note that if the vehicles are constantly moving,
their actual locations may have changed when
their reported locations data is received.

In that case, the data in the database is not
very accurate. A data will have an "uncertainty"
region around it where its corresponding
vehicle's actual location lies within.
The uncertainty region could be arbitrary or
simply a circle region using the reported
location as its center and has a radius of the
vehicle's maximum speed times the time elapsed
since the location data is reported.

The uncertainty region could also be associated
with an arbitrary probability density function
(pdf) for the probability of the vehicle's actual
location being in a particular point of the
region.
For an example, part of the region may be sea, so
the part may be associated with a total of 0.1
probability uniformly distributed for the points
in the part (for the vehicle crashes to any one
of those points). A probability 0.9 is for the
vehicle to occur in any points of the remaining
part of the region.

Such kind of data with uncertainty is called
uncertain data.
There is only few methods for clustering
uncertain data. UK-means clustering is a common
one.
UK-means clustering is the same as K-means
clustering except its distance function is using
the "expected distance" from the data's
uncertainty region to the representative of the
candidate cluster to whom it is assigned.

For the representative c of the cluster, an
uncertainty region R with a pdf f, and a
Euclidean distance function D(p,c) for any two
points, the expected distance, called ed, is

An uncertainty region could have arbitrary shape,
so the minimum bounding box (MBR) of the region
is used for R.
The time complexity in UK-means clustering is
O(nK) for computing the ed values for each of n
data and each of K candidate clusters.
This is very much especially for an uncertainty
region with arbitrary pdf f that needs to be
sampled using large number of samples in Monte
Carlo methods to compute f values for the points.

A method called Global-minmax Pruning is
developed in my research to prune out some
candidate clusters for a data to save some
computations of ed values.
Define lmin(R,M) to be a minimum limit on the ed
values computed for an uncertainty region R and a
MBR M of some candidate clusters representative
points.
Also define lmax(R,M) to be a maximum limit on
the ed values computed for an uncertainty region
R and a MBR M of some candidate clusters
representative points.

Define minmax to be the minimum of lmax values
among all candidate clusters for a data.
Then Global-minmax Pruning works like this
A KD-tree for a given height h is built to index
the representative points of the candidate
clusters using p entries in each
non-leaf node.
Hence (ph p) / (p 1) non-leaf entries.

The entries in each level, except the leaf level,
are sorted in increasing order in a particular
dimension alternately. O((h-1) K log(K) ).
KD-tree is usually small and can be stored in CPU
cache. Its access time is so little that it is
ignored.

15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20

The time complexity for computing the ed values
is then O(nK-P) if P candidate clusters are
pruned out, in average, for each object.

If the KD-tree uses a capacity p for its non-leaf
nodes, the worst time complexity for the pruning
process is O( (h-1) K log(K) n2K (ph p) /
(p 1) ) when no non-leaf entries are pruned
out.
Note this is not better than the time complexity
O(nK) for computing the ed values in UK-means
clustering without any pruning if computing an ed
value is not at least twice slower than computing
a lmax or lmin value (e.g. for uncertainty
regions with uniform pdf).
So another pruning method called Local-minmax
Pruning is developed in my research to address
that special case.

22
(No Transcript)
23
(No Transcript)
24

The worst time complexity for the pruning process
is then O( (h1) K log(K) n2K 2Q 2(ph
p) / (p 1) ) when Q leaf entries are pruned
out before their nodes are visited.
This is better than t O( (h-1) K log(K) n2K
Q (ph p) / (p 1) ) of Global-minmax
Pruning if Q gt (ph p) / (p 1).
This could even be better than the non-pruning
methods O(nK) for the earlier special case of
fast computation of an ed value against a lmin or
lmax value if Q gt K / 2 (ph p) / (p 1)
(h-1) K log(K) / 2n .

But Local-minmax Pruning is not as effective in
pruning as Global-minmax Pruning and hence could
be less efficient when the computations of ed
values are large overhead as its O(nk P)
increases.
Simple way to compute lmin(R,M) is to use the
Euclidean distance between the two nearest points
in R and M.
Simple way to compute lmax(R,M) is to use the
Euclidean distance between the two farthest
points in R and M.