Machine Learning

About This Presentation

Title:

Machine Learning

Description:

P(D|h) = probability of observing data D given some world where hypothesis holds ... Just as in the KDD Cup example. Feature selection is required on your homework. 18 ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 22

Provided by: Kathleen268

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning

1
Machine Learning

Reading Chapter 18, 20

2
Agenda

Naïve Bayes
Feature Selection
An example problem
Census Data
Weka Tutorial
The Homework

3
Classification Task

Input instance
Tuple of attribute values lta1,a2,angt
Output class
Any value hi from finite set H
Given a training set of instances, predict class
of new instance

4
Examples

H setosa, verginica, versicolour
Input instance
lts-length7, p-width3, p-length 4gt
Instance Representation lt7,3,4gt
Data Representation
7 3 4
10 2 1
5 6 2
H lt50, between 50 and 100, gt 100
Input instance
ltauthor, abstract, title, journalgt
Instance representation ltSmith, John, We
showed that , Brane New World, HEP1gt
Note difference in strings vs. categorical values

5
Bayesian Approach

Each observed training example can incrementally
decrease or increase probability of hypothesis
instead of eliminate an hypothesis
Prior knowledge can be combined with observed
data to determine hypothesis
Bayesian methods can accommodate hypotheses that
make probabilistic predictions
New instances can be classified by combining the
predictions of multiple hypotheses, weighted by
their probabilities

6
Applying Bayes Theorem

Best hypothesis most probable hypothesis
Maximum a posteriori (MAP) hypothesis
Variables
h hypothesis
D data
Prior probability
h P(h)
training data observed P(D)
P(Dh) probability of observing data D given
some world where hypothesis holds
Bayes theorem
P(hD) P(Dh)P(h) P(D)

7
Defining the MAP hypothesis

hMAPargmax P(hD) heH
hMAPargmax P(Dh)P(h) heH
P(D)
(Using Bayes Theorem)
hMAPargmax P(Dh)P(h) heH
(P(D) is a constant independent of h)
hMAPargmax P(Dh) heH(when we can
make the assumption that each hypothesis h is
equally probable)

8
Bayes Optimal Classifier

The most probable classification of the new
instance by combining the predictions of all
hypotheses weighted by their posterior
probabilities
Possible classifications vjeV
Argmax ? P(vjhi)P(hiD) vjeV hieH

9
Example

V p, n
P(h1D).4 P(ph1)0 P(n,h1)1
P(h2D).3 P(ph2)1 P(n,h2)0
P(h3D).3 P(ph3)1 P(n,h3)0
? P(nhi)P(hiD) .4hieH
? P(phi)P(hiD) .6
hieH
Argmax ? P(vjhi)P(hiD) p
vjep,n hieH

10
Properties of Bayesian Approach

Bayesian learning is optimal
Easy to estimate P(h) by counting in training
data
Estimating P(Dh) not feasible
Why?

11
P(Dh)
12
Naïve Bayes

Assume independence of attributes
D a1,a2,an
P(a1,a2,anvj)?P(aivj)
i
Substitute into VMAP formula
VNBargmax P(vj)?P(aivj) vj?V
i

13
VNBargmax P(vj)?P(aivj) vj?V
14
Estimating Probabilities

What happens when the number of data elements is
small?
Suppose true P(S-lengthhighverginica).05
There are only 2 instances with CVerginica
We estimate probability by nc/n or
S-lengthVerginica/C-Verginica
S-lengthVerginica must 0
Then, instead of .05 we use estimated probability
of 0
Two problems
Biased underestimate of probability
This probability term will dominate

15
Instead

Use priors as well
ncmp nm
Where p prior estimate
M is a constant called the equivalent sample size
Determines how heavily to weight p relative to
observed data
Typical method assume a uniform prior

16
Benefits of Naïve Bayes

Practical
As effective and in some cases, more so, than
other machine learners

17
Feature Selection

Can be used with any machine learning algorithm
Experimentally determine which attribute(s) helps
learning most
Just as in the KDD Cup example
Feature selection is required on your homework

18
Algorithm for feature selection

Forward selection
Incrementally add one attribute at a time
Train on training data (1/3 of the data)
Test on validation data (1/3) of the data)
whether the current set of attributes gives an
improvement
When done, test full model on held-out training
data
Backwards elimination

19
Forward Feature Selection

Start with no features and greedily add the
feature that most improves performance
Given a set of n attributes, measure performance
with each attribute alone
Note that this requires n separate learning
models
Choose the best
Now pair the best with every remaining attribute
and measure performance
Choose the best pair
Repeat with triples, quadruples, etc. until no
improvement

20
Example Systematic exploration of attribute
combinations