An Incremental Decision List Learner - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

An Incremental Decision List Learner

Description:

TBL/Maxent/Perceptron all better than decision lists. Entropy ... Could do better smoothing (Yarowsky) Could add splits at the top (Yarowsky) Conclusion ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 24

Provided by: josh9

Category:

more less

Transcript and Presenter's Notes

Title: An Incremental Decision List Learner

1
An Incremental Decision List Learner

Joshua Goodman
Machine Learning and Applied Statistics
Microsoft Research

2
Decision Lists

Fast to learn, fast to use, simple, small, can
give probabilities.
Ill only talk about probabilistic decision lists
Used for many applications, e.g. word sense
disambiguation, accent restoration, grammar
checking.

3
Sample Decision List

Word Sense Disambiguation
Bank river or financial?
If water nearby, output river .95
Else If money nearby, output river .1
Else If word beforeleft, output river .8
Else output river .5

4
Overview

Introduction to decision lists (how to learn)
The Parable of the Two Weathermen
Improved algorithm (use the smart weatherman!)
Experimental results
Almost two order of magnitude smaller lists
50 lower entropies
25 lower error rates
Conclusion Use this algorithm if you need small,
simple probabilities

5
Learning a Decision List

Standard algorithm
Start with all rules
If money nearby, output river .1
If word before left, output river .8
If water nearby, output river .95

6
Learning a Decision List

Standard algorithm
Start with all rules
Sort in order of how predictive they are
(entropy)
If water nearby, output river .95
Else If money nearby, output river .1
Else If word beforeleft, output river .8

7
Learning a Decision List

Standard algorithm
Start with all rules
Sort in order of how predictive they are
(entropy)
Add default rule
If water nearby, output river .95
Else If money nearby, output river .1
Else If word beforeleft, output river .8
ELSE output river .5

8
The Parable of the Two Seattle Weathermen

Consider two weathermen.
The LAZY weatherman says If Seattle, 93 chance
of rain today
The SMART weatherman looks for wind to blow away
the clouds. Most of the time he says Rain
today. Once a week he notices some wind,
and says 50 chance of rain today.

9
Parable of two weathemen, continued

Today, the smart weatherman felt some wind.
Who should you believe?
Lazy weatherman who says 93 rain
Smart weatherman who says 50 rain
Lazy weatherman is much more certain
Smart weatherman is very uncertain

10
What do weathermen have to do with EMNLP?

The standard decision list algorithm is like
trusting the lazy weatherman
Trusts rules that claim to be more certain,
rather than rules that reduce uncertainty
New algorithm is like trusting the smart
weatherman
Trusts rules that actually reduce the uncertainty
(entropy)

11
Uncertainty entropy

Traditional algorithm
For each rule x, compute entropy of the rule
P(y0x) ? log2 P(y0x) P(y1x) ? log2
P(y1x)
Sort by entropy, add default rule
Entropy of each weatherman
.93 ? log2 .93 .07 ? log2 .07 .37 bits
.5 ? log2 .5 .5 ? log2 .5 1 bit

12
Traditional algorithm

Two rules
If Seattle, predict 93 rain (entropy .37)
If Seattle and windy, predict 50 rain (entropy
1)
Sort in order of entropy, never get to the second
rule!
We trust the lazy weatherman!

13
New Algorithm

LIST DEFAULT RULE
while (1)
for each rule X compute entropy of X, LIST
Find best X
If gain from X, LIST lt 3 bits, break
LIST lt- X, LIST

14
New Algorithm, Example

Consider training data
Start with rule If Seattle, 93 chance of rain
Consider prepending rule If Seattle and wind,
50 chance of rain
Reduces uncertainty (entropy)

15
Speedups

New algorithm would be incredibly slow
Requires creating new decision list, prepending
rules to it, testing on training data
Can speed it up considerably

16
Speedups

For each training instance, keep track of
probabilities assigned by current LIST
Find changes to these probabilities as we prepend
each rule
Quickly compute incremental change

17
Experiments

Same train/test as Banko and Brill (2001)
Grammar checking problem 10 confusable word
pairs (e.g. there/their), guess which is right
Very similar to other problems used for decision
lists
3 training sizes (1M, 10M, 50M)

18
Results

Compared 7 learners
Sorted algorithm (traditional)
Sorted algorithm require gt0 improvement
Sorted algorithm require gt3 bits improvement
Avoids overfitting
Incremental (New) algorithm
Transformation-Based-Learner
Maximum Entropy (see my ACL paper)
Perceptron Algorithm (recent variation with
margin)

19
Error Rate

Incremental/Traditional Improve 3 best decision
list
TBL/Maxent/Perceptron all better than decision
lists

20
Entropy

Incremental/Traditional Improve 3 best decision
list
Maxent better than decision lists

21
Model Size

New algorithm is smallest probabilistic algorithm
TBL is smaller

22
Related Work/Improvements

Lots of related work on TBL, e.g. Ramshaw and
Marcus (this algorithm is probabilistic version
of various TBL algorithms.)
Could do better smoothing (Yarowsky)
Could add splits at the top (Yarowsky)

23
Conclusion