Statistical approach to numerical databases: clustering using normalised Minkowski metrics - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Statistical approach to numerical databases: clustering using normalised Minkowski metrics

Description:

attribute. Let the j-th feature component be distributed randomly with a ... introduced such that the means of contributions of all attributes are the same. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 11
Provided by: sce67
Category:

less

Transcript and Presenter's Notes

Title: Statistical approach to numerical databases: clustering using normalised Minkowski metrics


1
Statistical approach to numerical
databasesclustering using normalised Minkowski
metrics
D.T. Pham Y.I. Prostov Maria Suarez Alvarez
2
Clustering
  • Clustering is the unsupervised classification
    of data points into groups (clusters) based
    on some similarity or distance measure. Each
    group consists of data points that are similar
    between themselves and dissimilar to data
    points of other groups.
  • Typical Applications
  • As a stand-alone tool to get insight into data
    distribution, to observe the characteristics of
    each cluster and to focus on a particular set of
    clusters for analysis.
  • As a preprocessing step for other
    algorithms such as classification, which will
    operate on the selected clusters.

3
Similarity
  • It is very common to calculate the dissimilarity
    between two features using a distance measure.
    The most popular metric for continuous features
    is the Euclidean distance.
  • It is very intuitive to employ it because it is
    in everyday use.

4
Similarity Measures
  • Euclidean metric (distance) (or L2 metric)

TheTchebysheff (Chebyshev) or maximum norm
The Minkowski distance (or metric)
5
Statistical approach to normalisationof feature
vectors
  • Each record (row) of a database may be regarded
    as a random sample of the population under
    consideration, i.e. one has a database of N
    observations (samples) and each sample (record)
    is a realisation of possible values of the
    feature
  • vector A.
  • It is reasonable to expect that the mean
    contributions of individual features to the
    overall similarity measure should be equal.

6
Statistical characteristics of an attribute
  • Let the j-th feature component be distributed
    randomly with a density function
  • The variance of the distribution is and its
    mean is

7
Normalisation of featurevectors for Minkowski
metrics
  • The expectation of the contribution of the j-th
    attribute to the Minkowski metric is

8
Euclidean distance (p2)
  • For the following normalised variables
  • the mean contributions of all normalised
    components and to the square Euclidean
    distance are the same

9
Normalised Minkowski metrics
  • Introducing a constant for each normal
    distribution
  • The modified Minkowski metric for a
  • database is
  • where

10
Conclusion
  • A numerical database is regarded as a random
    sample of objects for the domain under
    consideration.
  • A statistical approach is applied to
    normalisation of all attributes of the feature
    vectors of data sets.
  • Using the Minkowski distance as an example of a
    similarity metric, new normalised metrics are
    introduced such that the means of contributions
    of all attributes are the same. Contributions of
    the features to similarity measures are
    approximately equalised.
  • Such a normalisation is achieved by scaling of
    the numerical attributes.
Write a Comment
User Comments (0)
About PowerShow.com