An Incremental Decision List Learner - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

An Incremental Decision List Learner

Description:

TBL/Maxent/Perceptron all better than decision lists. Entropy ... Could do better smoothing (Yarowsky) Could add splits at the top (Yarowsky) Conclusion ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 24
Provided by: josh9
Category:

less

Transcript and Presenter's Notes

Title: An Incremental Decision List Learner


1
An Incremental Decision List Learner
  • Joshua Goodman
  • Machine Learning and Applied Statistics
  • Microsoft Research

2
Decision Lists
  • Fast to learn, fast to use, simple, small, can
    give probabilities.
  • Ill only talk about probabilistic decision lists
  • Used for many applications, e.g. word sense
    disambiguation, accent restoration, grammar
    checking.

3
Sample Decision List
  • Word Sense Disambiguation
  • Bank river or financial?
  • If water nearby, output river .95
  • Else If money nearby, output river .1
  • Else If word beforeleft, output river .8
  • Else output river .5

4
Overview
  • Introduction to decision lists (how to learn)
  • The Parable of the Two Weathermen
  • Improved algorithm (use the smart weatherman!)
  • Experimental results
  • Almost two order of magnitude smaller lists
  • 50 lower entropies
  • 25 lower error rates
  • Conclusion Use this algorithm if you need small,
    simple probabilities

5
Learning a Decision List
  • Standard algorithm
  • Start with all rules
  • If money nearby, output river .1
  • If word before left, output river .8
  • If water nearby, output river .95

6
Learning a Decision List
  • Standard algorithm
  • Start with all rules
  • Sort in order of how predictive they are
    (entropy)
  • If water nearby, output river .95
  • Else If money nearby, output river .1
  • Else If word beforeleft, output river .8

7
Learning a Decision List
  • Standard algorithm
  • Start with all rules
  • Sort in order of how predictive they are
    (entropy)
  • Add default rule
  • If water nearby, output river .95
  • Else If money nearby, output river .1
  • Else If word beforeleft, output river .8
  • ELSE output river .5

8
The Parable of the Two Seattle Weathermen
  • Consider two weathermen.
  • The LAZY weatherman says If Seattle, 93 chance
    of rain today
  • The SMART weatherman looks for wind to blow away
    the clouds. Most of the time he says Rain
    today. Once a week he notices some wind,
    and says 50 chance of rain today.

9
Parable of two weathemen, continued
  • Today, the smart weatherman felt some wind.
  • Who should you believe?
  • Lazy weatherman who says 93 rain
  • Smart weatherman who says 50 rain
  • Lazy weatherman is much more certain
  • Smart weatherman is very uncertain

10
What do weathermen have to do with EMNLP?
  • The standard decision list algorithm is like
    trusting the lazy weatherman
  • Trusts rules that claim to be more certain,
    rather than rules that reduce uncertainty
  • New algorithm is like trusting the smart
    weatherman
  • Trusts rules that actually reduce the uncertainty
    (entropy)

11
Uncertainty entropy
  • Traditional algorithm
  • For each rule x, compute entropy of the rule
  • P(y0x) ? log2 P(y0x) P(y1x) ? log2
    P(y1x)
  • Sort by entropy, add default rule
  • Entropy of each weatherman
  • .93 ? log2 .93 .07 ? log2 .07 .37 bits
  • .5 ? log2 .5 .5 ? log2 .5 1 bit

12
Traditional algorithm
  • Two rules
  • If Seattle, predict 93 rain (entropy .37)
  • If Seattle and windy, predict 50 rain (entropy
    1)
  • Sort in order of entropy, never get to the second
    rule!
  • We trust the lazy weatherman!

13
New Algorithm
  • LIST DEFAULT RULE
  • while (1)
  • for each rule X compute entropy of X, LIST
  • Find best X
  • If gain from X, LIST lt 3 bits, break
  • LIST lt- X, LIST

14
New Algorithm, Example
  • Consider training data
  • Start with rule If Seattle, 93 chance of rain
  • Consider prepending rule If Seattle and wind,
    50 chance of rain
  • Reduces uncertainty (entropy)

15
Speedups
  • New algorithm would be incredibly slow
  • Requires creating new decision list, prepending
    rules to it, testing on training data
  • Can speed it up considerably

16
Speedups
  • For each training instance, keep track of
    probabilities assigned by current LIST
  • Find changes to these probabilities as we prepend
    each rule
  • Quickly compute incremental change

17
Experiments
  • Same train/test as Banko and Brill (2001)
  • Grammar checking problem 10 confusable word
    pairs (e.g. there/their), guess which is right
  • Very similar to other problems used for decision
    lists
  • 3 training sizes (1M, 10M, 50M)

18
Results
  • Compared 7 learners
  • Sorted algorithm (traditional)
  • Sorted algorithm require gt0 improvement
  • Sorted algorithm require gt3 bits improvement
  • Avoids overfitting
  • Incremental (New) algorithm
  • Transformation-Based-Learner
  • Maximum Entropy (see my ACL paper)
  • Perceptron Algorithm (recent variation with
    margin)

19
Error Rate
  • Incremental/Traditional Improve 3 best decision
    list
  • TBL/Maxent/Perceptron all better than decision
    lists

20
Entropy
  • Incremental/Traditional Improve 3 best decision
    list
  • Maxent better than decision lists

21
Model Size
  • New algorithm is smallest probabilistic algorithm
  • TBL is smaller

22
Related Work/Improvements
  • Lots of related work on TBL, e.g. Ramshaw and
    Marcus (this algorithm is probabilistic version
    of various TBL algorithms.)
  • Could do better smoothing (Yarowsky)
  • Could add splits at the top (Yarowsky)

23
Conclusion
  • If you need accuracy, use perceptron
  • If you need probabilities, use maxent
  • If you need accuracy and small size, use TBL
  • If you need probabilities and small size, use the
    new, incremental algorithm
  • Also, most understandable probabilistic output
Write a Comment
User Comments (0)
About PowerShow.com