Statistical approach to numerical databases: clustering using normalised Minkowski metrics

About This Presentation

Title:

Statistical approach to numerical databases: clustering using normalised Minkowski metrics

Description:

attribute. Let the j-th feature component be distributed randomly with a ... introduced such that the means of contributions of all attributes are the same. ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 11

Provided by: sce67

Category:

more less

Transcript and Presenter's Notes

Title: Statistical approach to numerical databases: clustering using normalised Minkowski metrics

1
Statistical approach to numerical
databasesclustering using normalised Minkowski
metrics
D.T. Pham Y.I. Prostov Maria Suarez Alvarez
2
Clustering

Clustering is the unsupervised classification
of data points into groups (clusters) based
on some similarity or distance measure. Each
group consists of data points that are similar
between themselves and dissimilar to data
points of other groups.
Typical Applications
As a stand-alone tool to get insight into data
distribution, to observe the characteristics of
each cluster and to focus on a particular set of
clusters for analysis.
As a preprocessing step for other
algorithms such as classification, which will
operate on the selected clusters.

3
Similarity

It is very common to calculate the dissimilarity
between two features using a distance measure.
The most popular metric for continuous features
is the Euclidean distance.
It is very intuitive to employ it because it is
in everyday use.

4
Similarity Measures

Euclidean metric (distance) (or L2 metric)

TheTchebysheff (Chebyshev) or maximum norm
The Minkowski distance (or metric)
5
Statistical approach to normalisationof feature
vectors

Each record (row) of a database may be regarded
as a random sample of the population under
consideration, i.e. one has a database of N
observations (samples) and each sample (record)
is a realisation of possible values of the
feature
vector A.
It is reasonable to expect that the mean
contributions of individual features to the
overall similarity measure should be equal.

6
Statistical characteristics of an attribute

Let the j-th feature component be distributed
randomly with a density function
The variance of the distribution is and its
mean is

7
Normalisation of featurevectors for Minkowski
metrics

The expectation of the contribution of the j-th
attribute to the Minkowski metric is

8
Euclidean distance (p2)

For the following normalised variables
the mean contributions of all normalised
components and to the square Euclidean
distance are the same

9
Normalised Minkowski metrics

Introducing a constant for each normal
distribution
The modified Minkowski metric for a
database is
where

10
Conclusion

A numerical database is regarded as a random
sample of objects for the domain under
consideration.
A statistical approach is applied to
normalisation of all attributes of the feature
vectors of data sets.
Using the Minkowski distance as an example of a
similarity metric, new normalised metrics are
introduced such that the means of contributions
of all attributes are the same. Contributions of
the features to similarity measures are
approximately equalised.
Such a normalisation is achieved by scaling of
the numerical attributes.