David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com - PowerPoint PPT Presentation

About This Presentation
Title:

David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com

Description:

Data Mining (and machine learning) DM Lecture 7: Feature Selection David Corne, and Nick Taylor, Heriot-Watt University - dwcorne_at_gmail.com These s and related ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 70
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: David%20Corne,%20and%20Nick%20Taylor,%20Heriot-Watt%20University%20-%20dwcorne@gmail.com


1
Data Mining(and machine learning)
  • DM Lecture 7 Feature Selection

2
Today
  • Finishing correlation/regression
  • Feature Selection
  • Coursework 2

3
Remember how to calculate r
If we have pairs of (x,y) values, Pearsons r is
Interpretation of this should be obvious (?)
4
Equivalently, you can do it like this
Looking at it another way after
z-normalisation X is the z-normalised x value
in the sample indicating how many stds away
from the mean it is. Same for Y The formula for
r on the last slide is equivalent to this

5
The names file in the CC dataset has correlation
values (class,target) for each field
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
6
here
min Max mean std correlation median mode
Population 0 1 0.06 0.13 0.37 0.02 0.01
Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41
Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98
racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02
racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01
agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38
agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49
agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29
agePct65up 0 1 0.42 0.18 0.07 0.42 0.47
numbUrban 0 1 0.06 0.13 0.36 0.03 0
pctUrban 0 1 0.7 0.44 0.08 1 1
medIncome 0 1 0.36 0.21 -0.42 0.32 0.23
pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58
pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41
pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1
pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44
medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25
perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23
whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3
blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18
indianPerCap 0 1 0.2 0.16 -0.09 0.17 0
AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18
OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0
HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3
NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08
PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19
7
Here are the top 20 (although the first doesnt
count) - this hints at how we might use
correlation for feature selection
ViolentCrimesPerPop 0 1 0.24 0.23 1 0.15 0.03 0
PctIlleg 0 1 0.25 0.23 0.74 0.17 0.09 0
PctKids2Par 0 1 0.62 0.21 -0.74 0.64 0.72 0
PctFam2Par 0 1 0.61 0.2 -0.71 0.63 0.7 0
racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98 0
PctYoungKids2Par 0 1 0.66 0.22 -0.67 0.7 0.91 0
PctTeen2Par 0 1 0.58 0.19 -0.66 0.61 0.6 0
racepctblack 0 1 0.18 0.25 0.63 0.06 0.01 0
pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41 0
pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1 0
FemalePctDiv 0 1 0.49 0.18 0.56 0.5 0.54 0
TotalPctDiv 0 1 0.49 0.18 0.55 0.5 0.57 0
PctPolicBlack 0 1 0.22 0.24 0.54 0.12 0 1675
MalePctDivorce 0 1 0.46 0.18 0.53 0.47 0.56 0
PctPersOwnOccup 0 1 0.56 0.2 -0.53 0.56 0.54 0
PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08 0
PctUnemployed 0 1 0.36 0.2 0.5 0.32 0.24 0
PctHousNoPhone 0 1 0.26 0.24 0.49 0.185 0.01 0
PctPolicMinor 0 1 0.26 0.23 0.49 0.2 0.07 1675
PctNotHSGrad 0 1 0.38 0.2 0.48 0.36 0.39 0
8
Can anyone see a potential problem with choosing
only (for example) the 20 features that correlate
best with the target class ?

9
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
10
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
11
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
12
Feature Selection What
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
you need to cut it down to 1,000 fields
before you try machine learning. Which 1,000?
The process of choosing the 1,000 fields to use
is called Feature Selection
13
Datasets with many features
  • Gene expression datasets (10,000 features)
  • http//www.ncbi.nlm.nih.gov/sites/entrez?dbgds
  • Proteomics data (20,000 features)
  • http//www.ebi.ac.uk/pride/

14
Feature Selection Why?
15
Feature Selection Why?
16
Feature Selection Why?
From http//elpub.scix.net/data/works/att/02-28.co
ntent.pdf
17
  • Quite easy to find lots more cases from papers,
    where experiments show that accuracy reduces when
    you use more features

18
  • Why does accuracy reduce with more features?
  • How does it depend on the specific choice of
    features?
  • What else changes if we use more features?
  • So, how do we choose the right features?

19
Why accuracy reduces
  • Note suppose the best feature set has 20
    features. If you add another 5 features,
    typically the accuracy of machine learning may
    reduce. But you still have the original 20
    features!! Why does this happen???

20
Noise / Explosion
  • The additional features typically add noise.
    Machine learning will pick up on spurious
    correlations, that might be true in the training
    set, but not in the test set.
  • For some ML methods, more features means more
    parameters to learn (more NN weights, more
    decision tree nodes, etc) the increased space
    of possibilities is more difficult to search.

21
Feature selection methods
22
Feature selection methods
  • A big research area!
  • This diagram from (Dash Liu, 1997)
  • Well look briefly at parts of it

23
Feature selection methods
24
Feature selection methods
25
Correlation-based feature ranking
  • This is what you will use in CW 2.
  • It is indeed used often, by practitioners (who
    perhaps dont understand the issues involved in
    FS)
  • It is actually fine for certain datasets.
  • It is not even considered in Dash Lius survey.

26
A made-up dataset
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
27
Correlated with the class
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
28
uncorrelated with the class / seemingly random
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
29
Correlation based FS reduces the dataset to this.
f1 f2 class
0.4 0.6 1
0.2 0.4 1
0.5 0.7 1
0.7 0.8 2
0.9 0.8 2
0.5 0.5 2
30
But, col 5 shows us f3 f4 which is perfectly
correlated with the class!
f1 f2 f3 f4 class
0.4 0.6 0.4 0.6 1 1
0.2 0.4 1.6 -0.6 1 1
0.5 0.7 1.8 -0.8 1 1
0.7 0.8 0.2 0.9 1.1 2
0.9 0.8 1.8 -0.7 1.1 2
0.5 0.5 0.6 0.5 1.1 2
31
Good FS Methods therefore
  • Need to consider how well features work together
  • As we have noted before, if you take 100 features
    that are each well correlated with the class,
    they may simply be correlated strongly with each
    other, so provide no more information than just
    one of them

32
Complete methods
  • Original dataset has N features
  • You want to use a subset of k features
  • A complete FS method means try every subset of
    k features, and choose the best!
  • the number of subsets is N! / k!(N-k)!
  • what is this when N is 100 and k is 5?

33
Complete methods
  • Original dataset has N features
  • You want to use a subset of k features
  • A complete FS method means try every subset of
    k features, and choose the best!
  • the number of subsets is N! / k!(N-k)!
  • what is this when N is 100 and k is 5?
  • 75,287,520 -- almost nothing

34
Complete methods
  • Original dataset has N features
  • You want to use a subset of k features
  • A complete FS method means try every subset of
    k features, and choose the best!
  • the number of subsets is N! / k!(N-k)!
  • what is this when N is 10,000 and k is 100?

35
Complete methods
  • Original dataset has N features
  • You want to use a subset of k features
  • A complete FS method means try every subset of
    k features, and choose the best!
  • the number of subsets is N! / k!(N-k)!
  • what is this when N is 10,000 and k is 100?
  • 5,000,000,000,000,000,000,000,000,000,

36
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,

37
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,

38
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,
  • 000,000,000,000,000,000,000,000,000,000,

39
  • continued for another 114 slides.
  • Actually it is around 5 1035,101
  • (there are around 1080 atoms in the universe)

40
Can you see a problem with complete methods?
41
Forward methods
  • These methods grow a set S of features
  • S starts empty
  • Find the best feature to add (by checking which
    one gives best performance on a test set when
    combined with S).
  • If overall performance has improved, return to
    step 2 else stop

42
Backward methods
  • These methods remove features one by one.
  • S starts with the full feature set
  • Find the best feature to remove (by checking
    which removal from S gives best performance on a
    test set).
  • If overall performance has improved, return to
    step 2 else stop

43
  • When might you choose forward instead of
    backward?

44
Random(ised) methodsaka Stochastic methods
Suppose you have 1,000 features. There are 21000
possible subsets of features. One way to try to
find a good subset is to run a stochastic search
algorithm E.g. Hillclimbing, simulated
annealing, genetic algorithm, particle swarm
optimisation,
45
One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
46
One slide introduction to (most) stochastic
search algorithms
A search algorithm BEGIN 1.
initialise a random population P of N candidate
solutions (maybe just 1) (e.g.
each solution is a random subset of features)
2. Evaluate each solution in P (e.g.
accuracy of 3-NN using only the features in
that solution) ITERATE 1.
generate a set C of new solutions, using the
good ones in P (e.g. choose a
good one and mutate it, combine bits of two or
more solutions, etc ) 2. evaluate
each of the new solutions in C. 3.
Update P e.g. by choosing the best N from all
of P and C 4. If we have iterated a
certain number of times, or accuracy is good
enough, stop
GENERATE
TEST
GENERATE
TEST
UPDATE
47
Why randomised/search methods are good for FS
Usually you have a large number of features (e.g.
1,000) You can give each feature a score (e.g.
correlation with target, Relief weight (see end
slides), etc ), and choose the best-scoring
features. This is very fast. However this does
not evaluate how well features work with other
features. You could give combinations of features
a score, but there are too many combinations of
multiple features. Search algorithms are the
only suitable approach that get to grips with
evaluating combinations of features.
48
CW2
49
CW2
  • Involves
  • Some basic dataset processing on CandC dataset
  • Applying a DMML technique called Naïve Bayes (NB
    already implemented by me)
  • Implementing your own script/code that can work
    out the correlation (Pearsons r) between any two
    fields
  • Running experiments to compare the results of NB
    when using the top 5, top 10 and top 20
    fields according to correlation with the class
    field.

50
CW2
  • Naïve Bayes
  • Described in next the last lecture.
  • It only works on discretized data, and predicts
    the class value of a target field.
  • It uses Bayesian probability in a simple way to
    come up with a best guess for the class value,
    based on the proportions in exactly the type of
    histograms you are doing for CW1
  • My NB awk script builds its probability model on
    the first 80 of data, and then outputs its
    average accuracy when applying this model to the
    remaining 20 of the data.
  • It also outputs a confusion matrix

51
CW2
52
(No Transcript)
53
If time the classic example of an
instance-based heuristic method
54
The Relief method
An instance-based, heuristic method it works
out weight values for Each feature, based on how
important they seem to be in discriminating
between near neighbours
55
The Relief method
There are two features here the x and the y
co-ordinate Initially they each have zero weight
wx 0 wy 0
56
The Relief method
wx 0 wy 0 choose an instance at random
57
The Relief method
wx 0 wy 0 choose an instance at random,
call it R
58
The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
59
The Relief method
wx 0 wy 0 find H (hit the nearest to R
of the same class) and M (miss the nearest to R
of different class)
H
M
60
The Relief method
wx 0 wy 0 now we update the weights based
on the distances between R and H and between R
and M. This happens one feature at a time
H
M
61
The Relief method
To change wx, we add to it (MR - HR)/n so,
the further the miss in the x direction, the
higher the weight of x the more important x is
in terms of discriminating the classes
H
M
MR
HR
62
The Relief method
To change wy, we add to it (MR - HR)/n
again, but this time calculated in the y
dimension clearly the difference is smaller
differences in this feature dont seem important
in terms of class value
H
M
MR
HR
63
The Relief method
Maybe now we have wx 0.07, wy 0.002.
64
The Relief method
wx 0.07, wy 0.002 Pick another instance at
random, and do the same again.
65
The Relief method
wx 0.07, wy 0.002 Identify H and M
M
H
66
The Relief method
wx 0.07, wy 0.002 Add the HR and MR
differences divided by n, for each feature, again

M
H
67
The Relief method
  • In the end, we have a weight value for each
    feature. The higher the value, the more relevant
    the feature.
  • We can use these weights for feature selection,
    simply by choosing the features with the S
    highest weights (if we want to use S features)
  • NOTE
  • It is important to use Relief F only on min-max
    normalised data in 0,1. However it is fine if
    category attibutes are involved, in which case
    use Hamming distance for those attributes,
  • Why divide by n? Then, the weight values can be
    interpreted as a difference in probabilities.

68
The Relief method, plucked directly from the
original paper (Kira and Rendell 1992)
69
Some recommended reading, if you are interested,
is on the website
Write a Comment
User Comments (0)
About PowerShow.com