Measurement%20of%20Similarity%20and%20Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Measurement%20of%20Similarity%20and%20Clustering

Description:

Otherwise you could claim 'Alex looks like Bob, but Bob looks nothing like Alex. ... Costas Hummingbird. Ruby Topaz Hummingbird. Kestrel. Gyrfalcon. Bald Eagle ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 55
Provided by: eam9
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Measurement%20of%20Similarity%20and%20Clustering


1
Measurement of Similarity and Clustering Dr
Eamonn Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
2
Outline of Talk
  • What is Similarity?
  • Some nomenclature
  • A useful tool (dendrogram)
  • Why Measure Similarity?
  • Classification
  • Clustering
  • Indexing
  • Desirable Properties of Similarity Measures
  • Mathematical properties
  • Intuitiveness
  • Time and space complexity
  • Two Approaches
  • Feature Projection
  • Transformation (Edit Distance)
  • Hierarchal Clustering

3
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
4
Some Nomenclature I
  • We shall talk of measuring similarity, however
    we are usually measuring dissimilarity.
  • Similarity The larger the number, the more
    alike two objects are.
  • Dissimilarity The larger the number, the less
    alike two objects are.
  • Distance is a common synonym for Dissimilarity,
    so we may speak of Distance measure and
    Dissimilarity measure interchangeably.
  • However a Distance measure is not the same
    thing as a Distance Metric. We will see why
    later

5
Some Nomenclature II
Similarity Queries are often expressed as Nearest
Neighbor Queries or Range Queries.
What is the nearest item to the green item?
What items are within R of the blue item?
a
a
c
c
b
b
R is given by the user
Can be generalized to the K nearest Neighbors
6
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram. (We will have
much more to say about dendrograms later)
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
7
(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
8
Swoopogram
Curvogram
Eurogram
Phenogram
Cladogram
Tree Diagram
9
Why Measure Similarity?
  • Classification Given an unlabeled item Q,
    assign it to one of two or more predefined
    classes. (We can do classification without
    measuring similarity, but similarity based
    methods (I.e. nearest neighbor), are very
    competitive).
  • Clustering Find natural groupings of items
    under some similarity measure.
  • Indexing (Query by Content) Given a query
    object Q, and some similarity measure, find the
    nearest matching item in the database, without
    having to examine every item.

10
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
11
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
12
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
13
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I have a database of 3 objects. Further
suppose that the triangular inequality holds, and
that we have precomplied a table of distance
between all the items in the database.
14
Why is the Triangular Inequality so Important?
Virtually all techniques to index data require
the triangular inequality to hold.
Suppose I am looking for the closest point to Q,
in a database of 3 objects. Further suppose that
the triangular inequality holds, and that we have
precomplied a table of distance between all the
items in the database. I find a and calculate
that it is 2 units from Q, it becomes my
best-so-far. I find b and calculate that it is
7.81 units away from Q. I dont have to calculate
the distance from Q to c! I know
D(Q,b) ? D(Q,c) D(b,c) D(Q,b) - D(b,c) ?
D(Q,c) 7.81 - 2.30 ? D(Q,c)
5.51 ? D(Q,c) So I know that c is at least
5.51 units away, but my best-so-far is only 2
units away.
a
Q
c
b
15
Thoughts on the Triangular Inequality I
Sometimes the triangular inequality requirement
maps nicely onto human intuitions. Consider the
similarity between the horse, the zebra and the
lion.
The horse and the zebra are very similar, and
both are very unlike the lion.
16
Thoughts on the Triangular Inequality II
Sometimes the triangular inequality requirement
fails to map onto human intuition. Consider the
similarity between the horse, a man and the
centaur.
The horse and the man are very different, but
both share many features with the centaur. This
relationship does not obey the triangular
inequality.
The centaur example is due to Remco Velkamp
17
What other properties should we require of a
distance measure
  • It should really measure similarity!!
  • It should be fast to compute
  • Euclidean distance and Hamming distance are
    O(n), Dynamic Time Warping and String Edit
    distance are O(n2)
  • It should be space efficient
  • This is usually not as important as time
    efficiency
  • It should allow indexing
  • If the measure is a metric, this is
    automatically true, otherwise it depends
  • A fast lower bound measure is desirable
  • ?A, B lower_bound_distance(A, B) ?
    true_distance(A, B)

Whatever that means
(We will see why on the next slide)
18
If not fast to compute, a fast lower bound
measure is desirable
Assume that true_distance(A, B) is the correct
distance function, but is very expensive to
compute, and that lower_bound_distance(A, B), is
a cheap lower bounding estimate of
true_distance(A, B), the above algorithm will
allow faster sequential searching.
19
  • If we want to measure the similarity between
    items, we will have to measure some features
  • Scalar
  • Binary Only two possible states.
  • True/False, Jew/Gentile, Married/Unmarried
  • Nominal Generalization of Binary to 3 or more
    states
  • Jew/Catholic/Protestant, Married/Divorced/Widower
  • In basketball, jersey numbers are nominal
  • You cannot order, or do any mathematical
    operations on nominal data

20
  • Scalar (continued)
  • Ordinal Same as nominal, but order matters.
    However the distance between two values is not
    meaningful
  • For example, we might have a coded survey, 0 No
    high school, 1 some high school, 2 high
    school diploma, 4 some college
  • While we can clearly rank these attribute, the
    distacne between a 1 and a 2 is not the same
    as the distance between a 2 and a 3.
  • Interval Distance between attributes is
    meaningful. In this case the we can measure
    intervals and take averages, but we cannot form
    ratios (I.E we cannot say 10 is twice as large as
    5)
  • For example, consider temperature in Fahrenheit
    or Celsius
  • Ratio You can meaningfully form ratios.
  • For example, weight, height, number of children

21
  • Scalar (continued)
  • Note that both Interval and Ratio data can be
    either discrete or continuous
  • For example consider the following two examples
    of ratio data
  • Number of Children (for a given person)
  • Average Number of Children (For women in
    different countries)
  • Some algorithms work better (or only work) for
    one of either discrete or continuous.
  • We can convert from continuous to discrete

22
In addition to scalar values, much of the data
we are interested in is nonscalar... Vectors or
Matrices of Binary/Nominal/Ordinal/Interval/Ratio
Bitmaps, Time Series, Strings, Trees, Graphs
23
Consider color, what kind of feature is
this? Nominal Scalar Blue, Red, Yellow
etc Ordinal Discrete Red, Orange, Yellow,
Green, Blue, Indigo, Violet. Ordinal Continuous
780 622nm, 622 597nm, 597 577nm Vector
Continuous 0.95, 0.01, 0.21 (Red/Green/Blue,
or Hue/Saturation/Luminosity) We sometimes have
a choice of representation. Often making the
right choice can be very important.
24
The similarity between two items depends on the
features we measure (and the distance measure
itself)
Last Name Similarity
Skin Color Similarity
0
115
25
  • Sometimes we are given the perfect features to
    measure similarity
  • sometimes we need to
  • Generate Features Suppose we hope to find
    similar people with regard to their medical
    conditions, knowing both their height and weight
    is not helpful, knowing their BMI is. (BMI
    Weight in kilos /Height in meters2 )
  • Clean Features Our features may contain noise
    or outliers.
  • Normalize Features We may need to transform
    features.
  • Reduce Features We may have too many features
    to do efficient similarity measurement, so
    dimensionality reduction may be necessary.

26
There is no single magic black box for
measuring similarity
  • However there are two useful and general tricks
  • Project the data into feature space, the distance
    in feature space (appropriately measured) becomes
    the similarity.
  • Transform one object into the other, the cost
    of this transformation becomes the similarity.

Feature Projection
Edit Distance
27
Feature Projection Example I
1.0
0.9
0.8
0.7
0.6
0.5
Ratio of beak length over body length
0.4
0.3
From left to right Bee Hummingbird Costas
Hummingbird Ruby Topaz Hummingbird Kestrel Gyrfalc
on Bald Eagle
Use the features to project the items into
feature space. The distance between two objects
in this space (appropriately measured) is the
measure of similarity
0.2
0.1
1
2
3
4
5
6
7
8
9
10
Body Mass
28
Feature Projection Example II
R. A. Fishers Iris Dataset. 3 variations of the
Iris flower 50 of each
29
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
30
Edit Distance Example I
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
31
Edit Distance Example II
We can make two time series appear more similar
by making one point on one map onto two (or more)
points it the other. For example, suppose we
have Q 5, 6, 8, 8, 7 and C 5, 6, 6, 8,
7
A one to one measure would have to match an 8
in Q to a 6 in C. However if we allowed
nonlinear alignments every number can match with
itself. Another way of looking at it is an
attempt to make the two sequences more similar by
inserting values
This is call Dynamic Time Warping
32
Dynamic Time Warping
Fixed Time Axis Sequences are aligned one to
one.
Warped Time Axis Nonlinear alignments are
possible.
33
The Minkowski Metric
So, we have projected our objects into feature
space. How do we measure the distance between
points?
Assume Q and C are vectors of features measured
from the objects of interest.
p 1 Manhattan (Rectilinear, City Block) p 2
Euclidean p ? Max (Supremum, sup)
34
The Minkowski Metric, a Weakness
Suppose we have a database of 3 items, with 2
features, number of children and temperature. We
want to know who is most similar to Mr Red
under the Euclidean distance.
44
110
Celsius
Fahrenheit
(5,96.8) (1,96.8) (5,102.2)
(5,36) (1,36) (5,38)
109
43
108
42
107
106
41
40
39
38
5.4
3
37
36
4
4
35
95
94
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
Green is closest to Red
Blue is closest to Red
The Minkowski metric is sensitive to the units
use to measure features, a very undesirable
property since the units are usually
arbitrary. Two solutions suggest themselves,
normalize the features or use a weighted version
of the Minkowski metric.
35
Normalizing Features
Let C be a database of items, with the ith
feature denoted by ci To normalize the database
After normalization each feature will have a mean
of zero and a standard deviation of one.
for each feature ci ci (mean(ci) /
std(ci) end
Note that in both these images the axes are
square (there is the same number of pixels per
unit in both the X and Y direction)
After normalization, both axes are equally
important
Before normalization, the Y-axis dominates
36
The Weighted Minkowski Metric
Assume Q and C are vectors of feature measured
from the objects of interest. Further assume
that W is a vector containing the relative
importance of the features
But how do we know the weights?
37
The Minkowski Metrics have Simple Geometric
Interpretations
Euclidean
Weighted Euclidean
Manhattan
Max
38
(No Transcript)
39
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
  • Organizing data into classes such that there is
  • high intra-class similarity
  • low inter-class similarity
  • Finding the class labels and the number of
    classes directly from the data (in contrast to
    classification).
  • More informally, finding natural groupings among
    objects.

40
What is a natural grouping among these objects?
41
What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
42
Even if we know in advance the number of clusters
we expect to see, the clustering obtained may be
subjective.
43
Two Types of Clustering
  • Partitional algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchical algorithms Create a hierarchical
    decomposition of the set of objects using some
    criterion

Partitional
Hierarchical
44
Desirable Properties of a Clustering Algorithm
  • Scalability (in terms of both time and space)
  • Ability to deal with different data types
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

45
Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
  • The number of dendrograms with n leafs (2n
    -3)!/(2(n -2)) (n -2)!
  • Number Number of Possible
  • of Leafs Dendrograms
  • 2 1
  • 3 3
  • 4 15
  • 5 105
  • ...
  • 34,459,425

46
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
47
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

48
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

49
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

50
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

51
?
In the first iteration of agglomerative
clustering we merged so we need to remove
them from the matrix
We now need to add the single cluster to our
new smaller matrix
But what values do we fill in? What is
D( , ) ? D( , ) ?
52
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
  • Single linkage (nearest neighbor) In this
    method the distance between two clusters is
    determined by the distance of the two closest
    objects (nearest neighbors) in the different
    clusters.
  • Complete linkage (furthest neighbor) In this
    method, the distances between clusters are
    determined by the greatest distance between any
    two objects in the different clusters (i.e., by
    the "furthest neighbors").
  • Group average In this method, the distance
    between two clusters is calculated as the average
    distance between all pairs of objects in the two
    different clusters.

53
Using Single linkage (nearest neighbor)
D( , ) Min D( , ), D( , ) 4 D(
, ) Min D( , ), D( , ) 7
54
  • Summary of Hierarchal Clustering Methods
  • No need to specify the number of clusters in
    advance.
  • Hierarchal nature maps nicely onto human
    intuition for some domains
  • They do not scale well time complexity of at
    least O(n2), where n is the number of total
    objects.
  • Like any heuristic search algorithms, local
    optima are a problem.
  • Interpretation of results is subjective.
Write a Comment
User Comments (0)
About PowerShow.com