Title: David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com
1Data Mining(and machine learning)
- DM Lecture 7 Feature Selection
-
2Today
- Finishing correlation/regression
-
- Feature Selection
- Coursework 2
3Remember how to calculate r
If we have pairs of (x,y) values, Pearsons r is
Interpretation of this should be obvious (?)
4Equivalently, you can do it like this
Looking at it another way after
z-normalisation X is the z-normalised x value
in the sample indicating how many stds away
from the mean it is. Same for Y The formula for
r on the last slide is equivalent to this
5The names file in the CC dataset has correlation
values (class,target) for each field
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
6here
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
7Here are the top 20 (although the first doesnt
count) - this hints at how we might use
correlation for feature selection
ViolentCrimesPerPop 0 1 0.24 0.23 1 0.15 0.03 0
PctIlleg 0 1 0.25 0.23 0.74 0.17 0.09 0
PctKids2Par 0 1 0.62 0.21 -0.74 0.64 0.72 0
PctFam2Par 0 1 0.61 0.2 -0.71 0.63 0.7 0
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98 0
PctYoungKids2Par 0 1 0.66 0.22 -0.67 0.7 0.91 0
PctTeen2Par 0 1 0.58 0.19 -0.66 0.61 0.6 0
racepctblack 0 1 0.18 0.25 0.63 0.06 0.01 0
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41 0
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1 0
FemalePctDiv 0 1 0.49 0.18 0.56 0.5 0.54 0
TotalPctDiv 0 1 0.49 0.18 0.55 0.5 0.57 0
PctPolicBlack 0 1 0.22 0.24 0.54 0.12 0 1675
MalePctDivorce 0 1 0.46 0.18 0.53 0.47 0.56 0
PctPersOwnOccup 0 1 0.56 0.2 -0.53 0.56 0.54 0
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08 0
PctUnemployed 0 1 0.36 0.2 0.5 0.32 0.24 0
PctHousNoPhone 0 1 0.26 0.24 0.49 0.185 0.01 0
PctPolicMinor 0 1 0.26 0.23 0.49 0.2 0.07 1675
PctNotHSGrad 0 1 0.38 0.2 0.48 0.36 0.39 0
8Can anyone see a potential problem with choosing
only (for example) the 20 features that correlate
best with the target class ?
9Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
10Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
11Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
12Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
The process of choosing the 1,000 fields to use
is called Feature Selection
13Datasets with many features
- Gene expression datasets (10,000 features)
- http//www.ncbi.nlm.nih.gov/sites/entrez?dbgds
- Proteomics data (20,000 features)
- http//www.ebi.ac.uk/pride/
14Feature Selection Why?
15Feature Selection Why?
16Feature Selection Why?
From http//elpub.scix.net/data/works/att/02-28.co
ntent.pdf
17- Quite easy to find lots more cases from papers,
where experiments show that accuracy reduces when
you use more features
18- Why does accuracy reduce with more features?
- How does it depend on the specific choice of
features? - What else changes if we use more features?
- So, how do we choose the right features?
19Why accuracy reduces
- Note suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning may
reduce. But you still have the original 20
features!! Why does this happen???
20Noise / Explosion
- The additional features typically add noise.
Machine learning will pick up on spurious
correlations, that might be true in the training
set, but not in the test set. - For some ML methods, more features means more
parameters to learn (more NN weights, more
decision tree nodes, etc) the increased space
of possibilities is more difficult to search.
21Feature selection methods
22Feature selection methods
- A big research area!
- This diagram from (Dash Liu, 1997)
- Well look briefly at parts of it
23Feature selection methods
24Feature selection methods
25Correlation-based feature ranking
- This is what you will use in CW 2.
- It is indeed used often, by practitioners (who
perhaps dont understand the issues involved in
FS) - It is actually fine for certain datasets.
- It is not even considered in Dash Lius survey.
26A made-up dataset
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
27Correlated with the class
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
28uncorrelated with the class / seemingly random
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
29Correlation based FS reduces the dataset to this.
f1 f2 class
0.4 0.6 1
0.2 0.4 1
0.5 0.7 1
0.7 0.8 2
0.9 0.8 2
0.5 0.5 2
30But, col 5 shows us f3 f4 which is perfectly
correlated with the class!
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1 1
0.2 0.4 1.6 -0.6 1 1
0.5 0.7 1.8 -0.8 1 1
0.7 0.8 0.2 0.9 1.1 2
0.9 0.8 1.8 -0.7 1.1 2
0.5 0.5 0.6 0.5 1.1 2
31Good FS Methods therefore
- Need to consider how well features work together
- As we have noted before, if you take 100 features
that are each well correlated with the class,
they may simply be correlated strongly with each
other, so provide no more information than just
one of them
32Complete methods
- Original dataset has N features
- You want to use a subset of k features
- A complete FS method means try every subset of
k features, and choose the best! - the number of subsets is N! / k!(N-k)!
- what is this when N is 100 and k is 5?
33Complete methods
- Original dataset has N features
- You want to use a subset of k features
- A complete FS method means try every subset of
k features, and choose the best! - the number of subsets is N! / k!(N-k)!
- what is this when N is 100 and k is 5?
- 75,287,520 -- almost nothing
34Complete methods
- Original dataset has N features
- You want to use a subset of k features
- A complete FS method means try every subset of
k features, and choose the best! - the number of subsets is N! / k!(N-k)!
- what is this when N is 10,000 and k is 100?
-
35Complete methods
- Original dataset has N features
- You want to use a subset of k features
- A complete FS method means try every subset of
k features, and choose the best! - the number of subsets is N! / k!(N-k)!
- what is this when N is 10,000 and k is 100?
- 5,000,000,000,000,000,000,000,000,000,
36- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
37- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
38- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
- 000,000,000,000,000,000,000,000,000,000,
39- continued for another 114 slides.
- Actually it is around 5 1035,101
- (there are around 1080 atoms in the universe)
40Can you see a problem with complete methods?
41Forward methods
- These methods grow a set S of features
- S starts empty
- Find the best feature to add (by checking which
one gives best performance on a test set when
combined with S). - If overall performance has improved, return to
step 2 else stop
42Backward methods
- These methods remove features one by one.
- S starts with the full feature set
- Find the best feature to remove (by checking
which removal from S gives best performance on a
test set). - If overall performance has improved, return to
step 2 else stop
43- When might you choose forward instead of
backward?
44Random(ised) methodsaka Stochastic methods
Suppose you have 1,000 features. There are 21000
possible subsets of features. One way to try to
find a good subset is to run a stochastic search
algorithm E.g. Hillclimbing, simulated
annealing, genetic algorithm, particle swarm
optimisation,
45One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
46One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
GENERATE
TEST
GENERATE
TEST
UPDATE
47Why randomised/search methods are good for FS
Usually you have a large number of features (e.g.
1,000) You can give each feature a score (e.g.
correlation with target, Relief weight (see end
slides), etc ), and choose the best-scoring
features. This is very fast. However this does
not evaluate how well features work with other
features. You could give combinations of features
a score, but there are too many combinations of
multiple features. Search algorithms are the
only suitable approach that get to grips with
evaluating combinations of features.
48CW2
49CW2
- Involves
- Some basic dataset processing on CandC dataset
- Applying a DMML technique called Naïve Bayes (NB
already implemented by me) - Implementing your own script/code that can work
out the correlation (Pearsons r) between any two
fields - Running experiments to compare the results of NB
when using the top 5, top 10 and top 20
fields according to correlation with the class
field.
50CW2
- Naïve Bayes
- Described in next the last lecture.
- It only works on discretized data, and predicts
the class value of a target field. - It uses Bayesian probability in a simple way to
come up with a best guess for the class value,
based on the proportions in exactly the type of
histograms you are doing for CW1 - My NB awk script builds its probability model on
the first 80 of data, and then outputs its
average accuracy when applying this model to the
remaining 20 of the data. - It also outputs a confusion matrix
51CW2
52(No Transcript)
53If time the classic example of an
instance-based heuristic method
54The Relief method
An instance-based, heuristic method it works
out weight values for Each feature, based on how
important they seem to be in discriminating
between near neighbours
55The Relief method
There are two features here the x and the y
co-ordinate Initially they each have zero weight
wx 0 wy 0
56The Relief method
wx 0 wy 0 choose an instance at random
57The Relief method
wx 0 wy 0 choose an instance at random,
call it R
58The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
59The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
H
M
60The Relief method
wx 0 wy 0 now we update the weights based
on the distances between R and H and between R
and M. This happens one feature at a time
H
M
61The Relief method
To change wx, we add to it (MR - HR)/n so,
the further the miss in the x direction, the
higher the weight of x the more important x is
in terms of discriminating the classes
H
M
MR
HR
62The Relief method
To change wy, we add to it (MR - HR)/n
again, but this time calculated in the y
dimension clearly the difference is smaller
differences in this feature dont seem important
in terms of class value
H
M
MR
HR
63The Relief method
Maybe now we have wx 0.07, wy 0.002.
64The Relief method
wx 0.07, wy 0.002 Pick another instance at
random, and do the same again.
65The Relief method
wx 0.07, wy 0.002 Identify H and M
M
H
66The Relief method
wx 0.07, wy 0.002 Add the HR and MR
differences divided by n, for each feature, again
M
H
67The Relief method
- In the end, we have a weight value for each
feature. The higher the value, the more relevant
the feature. - We can use these weights for feature selection,
simply by choosing the features with the S
highest weights (if we want to use S features) - NOTE
- It is important to use Relief F only on min-max
normalised data in 0,1. However it is fine if
category attibutes are involved, in which case
use Hamming distance for those attributes, - Why divide by n? Then, the weight values can be
interpreted as a difference in probabilities.
68The Relief method, plucked directly from the
original paper (Kira and Rendell 1992)
69Some recommended reading, if you are interested,
is on the website