Machine Learning - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Machine Learning

Description:

P(D|h) = probability of observing data D given some world where hypothesis holds ... Just as in the KDD Cup example. Feature selection is required on your homework. 18 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 22
Provided by: Kathleen268
Category:
Tags: cup | learning | machine

less

Transcript and Presenter's Notes

Title: Machine Learning


1
Machine Learning
  • Reading Chapter 18, 20

2
Agenda
  • Naïve Bayes
  • Feature Selection
  • An example problem
  • Census Data
  • Weka Tutorial
  • The Homework

3
Classification Task
  • Input instance
  • Tuple of attribute values lta1,a2,angt
  • Output class
  • Any value hi from finite set H
  • Given a training set of instances, predict class
    of new instance

4
Examples
  • H setosa, verginica, versicolour
  • Input instance
  • lts-length7, p-width3, p-length 4gt
  • Instance Representation lt7,3,4gt
  • Data Representation
  • 7 3 4
  • 10 2 1
  • 5 6 2
  • H lt50, between 50 and 100, gt 100
  • Input instance
  • ltauthor, abstract, title, journalgt
  • Instance representation ltSmith, John, We
    showed that , Brane New World, HEP1gt
  • Note difference in strings vs. categorical values

5
Bayesian Approach
  • Each observed training example can incrementally
    decrease or increase probability of hypothesis
    instead of eliminate an hypothesis
  • Prior knowledge can be combined with observed
    data to determine hypothesis
  • Bayesian methods can accommodate hypotheses that
    make probabilistic predictions
  • New instances can be classified by combining the
    predictions of multiple hypotheses, weighted by
    their probabilities

6
Applying Bayes Theorem
  • Best hypothesis most probable hypothesis
  • Maximum a posteriori (MAP) hypothesis
  • Variables
  • h hypothesis
  • D data
  • Prior probability
  • h P(h)
  • training data observed P(D)
  • P(Dh) probability of observing data D given
    some world where hypothesis holds
  • Bayes theorem
  • P(hD) P(Dh)P(h) P(D)

7
Defining the MAP hypothesis
  • hMAPargmax P(hD) heH
  • hMAPargmax P(Dh)P(h) heH
    P(D)
  • (Using Bayes Theorem)
  • hMAPargmax P(Dh)P(h) heH
    (P(D) is a constant independent of h)
  • hMAPargmax P(Dh) heH(when we can
    make the assumption that each hypothesis h is
    equally probable)

8
Bayes Optimal Classifier
  • The most probable classification of the new
    instance by combining the predictions of all
    hypotheses weighted by their posterior
    probabilities
  • Possible classifications vjeV
  • Argmax ? P(vjhi)P(hiD) vjeV hieH

9
Example
  • V p, n
  • P(h1D).4 P(ph1)0 P(n,h1)1
  • P(h2D).3 P(ph2)1 P(n,h2)0
  • P(h3D).3 P(ph3)1 P(n,h3)0
  • ? P(nhi)P(hiD) .4hieH
  • ? P(phi)P(hiD) .6
  • hieH
  • Argmax ? P(vjhi)P(hiD) p
  • vjep,n hieH

10
Properties of Bayesian Approach
  • Bayesian learning is optimal
  • Easy to estimate P(h) by counting in training
    data
  • Estimating P(Dh) not feasible
  • Why?

11
P(Dh)
12
Naïve Bayes
  • Assume independence of attributes
  • D a1,a2,an
  • P(a1,a2,anvj)?P(aivj)
    i
  • Substitute into VMAP formula
  • VNBargmax P(vj)?P(aivj) vj?V
    i

13
VNBargmax P(vj)?P(aivj) vj?V
14
Estimating Probabilities
  • What happens when the number of data elements is
    small?
  • Suppose true P(S-lengthhighverginica).05
  • There are only 2 instances with CVerginica
  • We estimate probability by nc/n or
    S-lengthVerginica/C-Verginica
  • S-lengthVerginica must 0
  • Then, instead of .05 we use estimated probability
    of 0
  • Two problems
  • Biased underestimate of probability
  • This probability term will dominate

15
Instead
  • Use priors as well
  • ncmp nm
  • Where p prior estimate
  • M is a constant called the equivalent sample size
  • Determines how heavily to weight p relative to
    observed data
  • Typical method assume a uniform prior

16
Benefits of Naïve Bayes
  • Practical
  • As effective and in some cases, more so, than
    other machine learners

17
Feature Selection
  • Can be used with any machine learning algorithm
  • Experimentally determine which attribute(s) helps
    learning most
  • Just as in the KDD Cup example
  • Feature selection is required on your homework

18
Algorithm for feature selection
  • Forward selection
  • Incrementally add one attribute at a time
  • Train on training data (1/3 of the data)
  • Test on validation data (1/3) of the data)
    whether the current set of attributes gives an
    improvement
  • When done, test full model on held-out training
    data
  • Backwards elimination

19
Forward Feature Selection
  • Start with no features and greedily add the
    feature that most improves performance
  • Given a set of n attributes, measure performance
    with each attribute alone
  • Note that this requires n separate learning
    models
  • Choose the best
  • Now pair the best with every remaining attribute
    and measure performance
  • Choose the best pair
  • Repeat with triples, quadruples, etc. until no
    improvement

20
Example Systematic exploration of attribute
combinations
  • Journal article downloads
  • Started with rock bottom down loads only
  • Tried abstract alone, author alone
  • Abstract and author together yielded better
    results
  • Then added in-degree, out-degree

21
Feature selection
  • Can be done within WEKA
  • Can be done outside of WEKA
  • By creating different data sets
Write a Comment
User Comments (0)
About PowerShow.com