David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation

Title:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

Description:

Data Mining (and machine learning) DM Lecture 6: Similarity and Distance – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 35

Provided by: macsHwAc2

Category:

more less

Transcript and Presenter's Notes

Title: David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

1
Data Mining(and machine learning)

DM Lecture 6 Similarity and Distance

2
Today

Similarity / Distance between data records, is
used in
Clustering
Many machine learning methods
Many, many, many practical applications
More fundamentally
Sometimes data records are entirely unstructured
e.g free text answers in a questionnaire, news
articles, etc.
To do DM/ML they need to be structured somehow
Then we can cluster them, etc, etc
Plus
Notes about validation and overfitting

3
k-nearest neighbour
The simplest machine learning method of all!
up
down
4
k-nearest neighbour
A new point should it be classed as Up or Down
?
up
down
5
k-nearest neighbour
A 1-NN classifier says UP
up
down
6
k-nearest neighbour

A 3-NN classifier says DOWN

up
down
7
k-nearest neighbour

A 5-NN classifier says DOWN

up
down
8
k-nearest neighbour
What might 3-NN say in this case, and would it be
correct?
up
down
9
k-nearest neighbour
What might 3-NN say in this case, and would it be
correct?
up
down
10
K-NN

Extremely simple
Often very good performance
Most suitable for datasets where there are clear
geographic clusters
Even on complex data, provides a good guess
Like almost all DM/ML techniques, it relies
exclusively on a distance measure different
ways to measure distance will give different
results.

11
k-nearest neighbour
By 3-NN, is the red car a Ford or a Chevrolet?
Ford
Chevrolet
Cost
Miles per gallon
12
k-nearest neighbour
By 3NN, is the red car a Ford or a Chevrolet?
Ford
Chevrolet
Cost (cents)
Miles per gallon
13
Distance measures
Euclidean distance
Point 1 is Point 2 is Euclidean distance is
14
Distance measures
Manhattan distance (aka city-block distance)
Point 1 is Point 2 is Manhattan distance is
(in case you dont know is the
absolute value of x. )
15
Distance measures
Chebychev distance
Point 1 is Point 2 is Chebychev distance is
16
Distance measures
(red, male, big, hot) (green, male, small,
hot)
Proportion different
Point 1 is Point 2 is Proportion different is
17
Distance measures
(bread, cheese, milk, nappies) (batteries,
cheese)
Jaccard coefficient
Point 1 is a set A Point 2 is a set
B Jaccard Coefficient is
The number of things that appear in both (1 -
cheese), divided by the total number of different
things (5))
18
Using common sense
Data vectors are (colour, manufacturer,
top-speed) e.g. (red, ford, 180)
(yellow, toyota, 160)
(silver, bugatti, 300)
What distance measure will you use?
19
Using common sense
Data vectors are (colour, manufacturer,
top-speed) e.g. (dark, ford,
high) (medium, toyota,
high) (light, bugatti,
very-high)
What distance measure will you use?
20
Using common sense
With different types of fields, e.g.
p1 (red, high, 0.5, UK, 12) p2
(blue, high, 0.6, France, 15)
You could simply define a distance measure for
each field Individually, and add them
up. Similarly, you could divide the vectors into
ordinal and numeric parts
p1a (red, high, UK) p1b (0.5, 12) p2a
(blue, high, France) p2b (0.6, 15)
and say that dist(p1, p2) dist(p1a,p2a)d(p1b,p2
b) using appropriate measures for the two kinds
of vector.
21
Notes
Suppose one field varies hugely (standard
deviation is 100), and one field varies a tiny
amount (standard deviation 0.001) why is
Euclidean distance a bad idea? What can you do?
What is the distance between these two? Star
Trek Voyager Satr Trek Voyagger
Normalising fields individually is often a good
idea when a numerical field is normalised, that
means you scale it so that the mean is 0 and
the standard deviation is 1.
22
Text a prime example of unstructured data

23
How did I get these vectors from these two
documents?

lth1gt Compilerslt/h1gt ltpgt The Guardian uses
several compilers for its daily
cryptic crosswords. One of the most frequently
used is Araucaria, and one of the most
difficult is Bunthorne.lt/pgt
lth1gt Compilers lecture 1 lt/h1gt ltpgt This lecture
will introduce the concept of lexical analysis,
in which the source code is scanned to reveal the
basic tokens it contains. For this, we will need
the concept of regular expressions (r.e.s).lt/pgt
1, 1, 1, 0, 0, 0
0, 0, 0, 1, 1, 1
25
An unfair question, but I got that by using the
following word vector (Crossword, Cryptic,
Difficult, Expression, Lexical, Token) If a
document contains the word crossword, it gets a
1 in position 1 of the vector, otherwise 0. If it
contains lexical, it gets a 1 in position 5,
otherwise 0, and so on. How similar would be the
vectors for two pages about crossword compilers?
The key to measuring document similarity is
turning documents into vectors based on specific
words and their frequencies.
26
Turning a document into a vector
We start with a template for the vector, which
needs a master list of terms . A term can be a
word, or a number, or anything that appears
frequently in documents.
There are almost 200,000 words in English it
would take much too long to process documents
vectors of that length. Commonly, vectors are
made from a small number (501000) of most
frequently-occurring words. However, the master
list usually does not include words from a
stoplist, Which contains words such as the, and,
there, which, etc why?
27
Turning a document into a vector
Suppose our Master List is (banana, cat, dog,
fish, read)
Suppose documents 1,2,3 are Bananas are
grown in hot countries, and cats like bananas.
It is raining cats and dogs today cats
like seafood
Assuming I first do stemming, or equivalent, the
vector encodings of these documents could be
1. (2, 1, 0, 0, 0) 2. (0, 1,
1, 0, 0) 3. (0, 1, 0, 0, 0) What
distance measure would you use? Does it make any
sense?
28
Text mining
Encoding documents as vectors is a hot topic, and
there are many important and valuable
applications, e.g. Predicting sentiment
if a document describes a movie, how much
does it rate that movie, on a scale of 1 to 10?
What document(s) are the most appropriate
for a search engine to retrieve from a search
query? Is document A plagiarized from
document B? And, there are better standard ways
to encode documents, such as TFIDF - I cover that
in the Web Intelligence module.
29
Overfitting
Suppose we train an a classifier to tell the
difference between handwritten t and c, using
only these examples
ts
The classifier will learn easily. It will
probably gives 100 correct prediction on these
cases.
cs
30
Overfitting
BUT this classifier will probably generalise
very poorly it will perform very badly on a test
set E.g. here is potential (very likely)
performance on certain unseen cases
It will probably predict that this is a c
Why?
It will probably predict that this is a t
31
Avoiding Overfitting
It can be avoided by using as much training
data as possible,ensuring as much diversity as
possible in the data. This cuts down on the
potential existence of features that might be
discriminative in the training data, but are
otherwise spurious. It can be avoided by
jittering (adding noise). During training, every
time an input pattern ispresented, it is randomly
perturbed. The idea of this is that spurious
features will be washed out by the noise, but
valid discriminatory features will remain. The
problem with this approach is how to correctly
choose the level of noise.
32
Avoiding Overfitting II
A typical curve showing performance during
training.
But here is performance on unseen data, not in
the training set.
Training data
error
Time for methods like neural networks
33
Avoiding Overfitting III
Another approach is early stopping. During
training, keep track of the networks performance
on a separate validation set of data. At the
point where error continues to improve on the
training set, but starts to get worse on the
validation set, that is when training should be
stopped, since it is starting to overfit on the
training data. The problem here is that this
point is far from always clear cut.
34