David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation

Title:

David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com

Description:

Data Mining (and machine learning) DM Lecture 7: Feature Selection David Corne, and Nick Taylor, Heriot-Watt University - dwcorne_at_gmail.com These s and related ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 70

Provided by: acuk

Category:

more less

Transcript and Presenter's Notes

Title: David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com

1
Data Mining(and machine learning)

DM Lecture 7 Feature Selection

2
Today

Finishing correlation/regression
Feature Selection
Coursework 2

3
Remember how to calculate r
If we have pairs of (x,y) values, Pearsons r is
Interpretation of this should be obvious (?)
4
Equivalently, you can do it like this
Looking at it another way after
z-normalisation X is the z-normalised x value
in the sample indicating how many stds away
from the mean it is. Same for Y The formula for
r on the last slide is equivalent to this

5
The names file in the CC dataset has correlation
values (class,target) for each field
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
6
here
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
7
Here are the top 20 (although the first doesnt
count) - this hints at how we might use
correlation for feature selection
ViolentCrimesPerPop 0 1 0.24 0.23 1 0.15 0.03 0
PctIlleg 0 1 0.25 0.23 0.74 0.17 0.09 0
PctKids2Par 0 1 0.62 0.21 -0.74 0.64 0.72 0
PctFam2Par 0 1 0.61 0.2 -0.71 0.63 0.7 0
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98 0
PctYoungKids2Par 0 1 0.66 0.22 -0.67 0.7 0.91 0
PctTeen2Par 0 1 0.58 0.19 -0.66 0.61 0.6 0
racepctblack 0 1 0.18 0.25 0.63 0.06 0.01 0
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41 0
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1 0
FemalePctDiv 0 1 0.49 0.18 0.56 0.5 0.54 0
TotalPctDiv 0 1 0.49 0.18 0.55 0.5 0.57 0
PctPolicBlack 0 1 0.22 0.24 0.54 0.12 0 1675
MalePctDivorce 0 1 0.46 0.18 0.53 0.47 0.56 0
PctPersOwnOccup 0 1 0.56 0.2 -0.53 0.56 0.54 0
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08 0
PctUnemployed 0 1 0.36 0.2 0.5 0.32 0.24 0
PctHousNoPhone 0 1 0.26 0.24 0.49 0.185 0.01 0
PctPolicMinor 0 1 0.26 0.23 0.49 0.2 0.07 1675
PctNotHSGrad 0 1 0.38 0.2 0.48 0.36 0.39 0
8
Can anyone see a potential problem with choosing
only (for example) the 20 features that correlate
best with the target class ?

9
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
10
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
11
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
12
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
The process of choosing the 1,000 fields to use
is called Feature Selection
13
Datasets with many features

Gene expression datasets (10,000 features)
http//www.ncbi.nlm.nih.gov/sites/entrez?dbgds
Proteomics data (20,000 features)
http//www.ebi.ac.uk/pride/

14
Feature Selection Why?
15
Feature Selection Why?
16
Feature Selection Why?
From http//elpub.scix.net/data/works/att/02-28.co
ntent.pdf
17

Quite easy to find lots more cases from papers,
where experiments show that accuracy reduces when
you use more features

Why does accuracy reduce with more features?
How does it depend on the specific choice of
features?
What else changes if we use more features?
So, how do we choose the right features?

19
Why accuracy reduces

Note suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning may
reduce. But you still have the original 20
features!! Why does this happen???

20
Noise / Explosion

The additional features typically add noise.
Machine learning will pick up on spurious
correlations, that might be true in the training
set, but not in the test set.
For some ML methods, more features means more
parameters to learn (more NN weights, more
decision tree nodes, etc) the increased space
of possibilities is more difficult to search.

21
Feature selection methods
22
Feature selection methods

A big research area!
This diagram from (Dash Liu, 1997)
Well look briefly at parts of it

23
Feature selection methods
24
Feature selection methods
25
Correlation-based feature ranking

This is what you will use in CW 2.
It is indeed used often, by practitioners (who
perhaps dont understand the issues involved in
FS)
It is actually fine for certain datasets.
It is not even considered in Dash Lius survey.

26
A made-up dataset
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
27
Correlated with the class
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
28
uncorrelated with the class / seemingly random
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
29
Correlation based FS reduces the dataset to this.
f1 f2 class
0.4 0.6 1
0.2 0.4 1
0.5 0.7 1
0.7 0.8 2
0.9 0.8 2
0.5 0.5 2
30
But, col 5 shows us f3 f4 which is perfectly
correlated with the class!
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1 1
0.2 0.4 1.6 -0.6 1 1
0.5 0.7 1.8 -0.8 1 1
0.7 0.8 0.2 0.9 1.1 2
0.9 0.8 1.8 -0.7 1.1 2
0.5 0.5 0.6 0.5 1.1 2
31
Good FS Methods therefore

Need to consider how well features work together
As we have noted before, if you take 100 features
that are each well correlated with the class,
they may simply be correlated strongly with each
other, so provide no more information than just
one of them

32
Complete methods

Original dataset has N features
You want to use a subset of k features
A complete FS method means try every subset of
k features, and choose the best!
the number of subsets is N! / k!(N-k)!
what is this when N is 100 and k is 5?

33
Complete methods

Original dataset has N features
You want to use a subset of k features
A complete FS method means try every subset of
k features, and choose the best!
the number of subsets is N! / k!(N-k)!
what is this when N is 100 and k is 5?
75,287,520 -- almost nothing

34
Complete methods

Original dataset has N features
You want to use a subset of k features
A complete FS method means try every subset of
k features, and choose the best!
the number of subsets is N! / k!(N-k)!
what is this when N is 10,000 and k is 100?

35
Complete methods

Original dataset has N features
You want to use a subset of k features
A complete FS method means try every subset of
k features, and choose the best!
the number of subsets is N! / k!(N-k)!
what is this when N is 10,000 and k is 100?
5,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,

000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,

continued for another 114 slides.
Actually it is around 5 1035,101
(there are around 1080 atoms in the universe)

40
Can you see a problem with complete methods?
41
Forward methods

These methods grow a set S of features
S starts empty
Find the best feature to add (by checking which
one gives best performance on a test set when
combined with S).
If overall performance has improved, return to
step 2 else stop

42
Backward methods

These methods remove features one by one.
S starts with the full feature set
Find the best feature to remove (by checking
which removal from S gives best performance on a
test set).
If overall performance has improved, return to
step 2 else stop

When might you choose forward instead of
backward?

44
Random(ised) methodsaka Stochastic methods
Suppose you have 1,000 features. There are 21000
possible subsets of features. One way to try to
find a good subset is to run a stochastic search
algorithm E.g. Hillclimbing, simulated
annealing, genetic algorithm, particle swarm
optimisation,
45
One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
46
One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
GENERATE
TEST
GENERATE
TEST
UPDATE
47
Why randomised/search methods are good for FS
Usually you have a large number of features (e.g.
1,000) You can give each feature a score (e.g.
correlation with target, Relief weight (see end
slides), etc ), and choose the best-scoring
features. This is very fast. However this does
not evaluate how well features work with other
features. You could give combinations of features
a score, but there are too many combinations of
multiple features. Search algorithms are the
only suitable approach that get to grips with
evaluating combinations of features.
48
CW2
49
CW2

Involves
Some basic dataset processing on CandC dataset
Applying a DMML technique called Naïve Bayes (NB
already implemented by me)
Implementing your own script/code that can work
out the correlation (Pearsons r) between any two
fields
Running experiments to compare the results of NB
when using the top 5, top 10 and top 20
fields according to correlation with the class
field.

50
CW2

Naïve Bayes
Described in next the last lecture.
It only works on discretized data, and predicts
the class value of a target field.
It uses Bayesian probability in a simple way to
come up with a best guess for the class value,
based on the proportions in exactly the type of
histograms you are doing for CW1
My NB awk script builds its probability model on
the first 80 of data, and then outputs its
average accuracy when applying this model to the
remaining 20 of the data.
It also outputs a confusion matrix

51
CW2
52
(No Transcript)
53
If time the classic example of an
instance-based heuristic method
54
The Relief method
An instance-based, heuristic method it works
out weight values for Each feature, based on how
important they seem to be in discriminating
between near neighbours
55
The Relief method
There are two features here the x and the y
co-ordinate Initially they each have zero weight
wx 0 wy 0
56
The Relief method
wx 0 wy 0 choose an instance at random
57
The Relief method
wx 0 wy 0 choose an instance at random,
call it R
58
The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
59
The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
H
M
60
The Relief method
wx 0 wy 0 now we update the weights based
on the distances between R and H and between R
and M. This happens one feature at a time
H
M
61
The Relief method
To change wx, we add to it (MR - HR)/n so,
the further the miss in the x direction, the
higher the weight of x the more important x is
in terms of discriminating the classes
H
M
MR
HR
62
The Relief method
To change wy, we add to it (MR - HR)/n
again, but this time calculated in the y
dimension clearly the difference is smaller
differences in this feature dont seem important
in terms of class value
H
M
MR
HR
63
The Relief method
Maybe now we have wx 0.07, wy 0.002.
64
The Relief method
wx 0.07, wy 0.002 Pick another instance at
random, and do the same again.
65
The Relief method
wx 0.07, wy 0.002 Identify H and M
M
H
66
The Relief method
wx 0.07, wy 0.002 Add the HR and MR
differences divided by n, for each feature, again

M
H
67
The Relief method

In the end, we have a weight value for each
feature. The higher the value, the more relevant
the feature.
We can use these weights for feature selection,
simply by choosing the features with the S
highest weights (if we want to use S features)
NOTE
It is important to use Relief F only on min-max
normalised data in 0,1. However it is fine if
category attibutes are involved, in which case
use Hamming distance for those attributes,
Why divide by n? Then, the weight values can be
interpreted as a difference in probabilities.

68
The Relief method, plucked directly from the
original paper (Kira and Rendell 1992)
69
Some recommended reading, if you are interested,
is on the website

Write a Comment

User Comments (0)