Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

1 / 35

About This Presentation

Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Some basic information theory: Entropy, perplexity. Smoothing techniques ... Perplexity: 2H. Intuitively: weighted average number of choices a random variable ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 36

Provided by: jonas5

more less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics

5. Smoothing Techniques
Jonas Kuhn

Universität Potsdam, 2007

2
Overview

Evaluating language models
Some basic information theory Entropy,
perplexity
Smoothing techniques
Simple smoothing techniques
Next time
More advanced smoothing techniques
Backoff, linear interpolation

3
Evaluating language models

We need evaluation metrics to determine how good
our language models predict the next word
Intuition one should average over the
probability of new words

4
Some basic information theory

Evaluation metrics for language models
Information theory measures of information
Entropy
Perplexity

5
Entropy

Average length of most efficient coding for a
random variable
Binary encoding
Example betting on horses

6
Entropy

Example betting on horses
8 horses, each horse is equally likely to win
(Binary) Message required
001, 010, 011, 100, 101, 110, 111, 000
3-bit message required

7
Entropy

8 horses, some horses are more likely to win
Horse 1 ½ 0
Horse 2 ¼ 10
Horse 3 1/8 110
Horse 4 1/16 1110
Horse 5-8 1/64 111100, 111101, 111110, 111111

8
Perplexity

Entropy H
Perplexity 2H
Intuitively weighted average number of choices a
random variable has to make
Equally likely horses Entropy 3 Perplexity 23
8
Biased horses Entropy 2 Perplexity 22 4

9
Entropy of a sequence

Finite sequence strings from a language L
Entropy rate (per-word entropy)

10
Entropy of a language

Entropy rate of language L
Shannon-McMillan-Breimann Theorem
If a language is stationary and ergodic
A single sequence if it is long enough is
representative for the language

11
Cross Entropy

Often we compare the quality of various models of
some actual probability distribution p which we
do not know
Language model example
Cross entropy H(p, m) is an upper bound on the
entropy H(p)
Comparing two models m1 and m1, the one with the
lower cross-entropy is more accurate

12
Cross Entropy for sequences

With Shannon-McMillan-Breimann Theorem
For a stationary ergodic process
The cross entropy H(p,m) is an upper bound on the
entropy H(p)
For any model m

13
Entropy of English

(compare Jurafsky/Martin, sec. 6.7)
Per-letter entropy
Shannon (1951) Psychological experiments with
human subjects guessing the next letter in a
sequence
1.3 bits (for 27 characters 26 letters plus
space)
Brown et al. (1992) training of a trigram
grammar in serveral steps
1.75 bits per character (based on 95 ASCII
characters)

14
Smoothing Techniques

Simple Add-One Smoothing
Two illustrations
Based on Manning/Schütze
Based on Jurafsky/Martin
(Next time) More sophisticated Smoothing
Techniques

15
Illustration of Simple Smoothing Techniques

Example from Manning/Schütze, ch. 6
(based on slides by Jonathan Henke, UC
Berkeley)
Corpus five Jane Austen novels
N 617,091 words
V 14,585 unique words
Task predict the next word of the trigram
inferior to ________
from test data, Persuasion In person, she was
inferior to both sisters.

16
Instances in the Training Corpusinferior to
________
17
Maximum Likelihood Estimate
18
Actual Probability Distribution
19
Actual Probability Distribution
20
Smoothing

Develop a model which decreases probability of
seen events and allows the occurrence of
previously unseen N-grams
a.k.a. Discounting methods
Validation Smoothing methods which utilize a
second batch of test data

21
LaPlaces Law (Adding One)
22
LaPlaces Law (Adding One)
23
LaPlaces Law
24
Smoothing Techniques

Add-One Smoothing (Laplaces Law)
Let us look at unigrams first
Unsmoothed maximum likelihood estimate
With Add-One Smoothing

25
Smoothing Techniques

Add-One Smoothing Alternative computation
Adjusted count c adding one to the count and
multiplying by the following normalizing factor
(where V is the vocabulary
size)
Adjusted count
Probabilities

26
Add-One Smoothing for Bigrams

Unsmoothed bigram probabilities
Add-one-smoothed version

27
Original relative frequency example

(Illustration from Jurafsky/Martin)
Bigram counts from Berkeley Restautant Project
I want to eat Chinese food lunch
I 8 1087 0 13 0 0 0
want 3 0 786 0 6 8 6
to 3 0 10 860 3 0 12
eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
food 19 0 17 0 0 0 0
lunch 4 0 0 0 0 1 0

28
Original relative frequency example

Unigram counts from corpus
I 3437
want 1215
to 3256
eat 938
Chinese 213
food 1506
lunch 459

29
Original relative frequency example

Bigram probabilities (after normalizing, by
dividing through unigram counts)
I want to eat Chinese food lunch
I .0023 .32 0 .0038 0 0 0
want .0025 0 .65 0 .0049 .0066 .0049
to .00092 0 .0031 .26 .00092 0 .0037
eat 0 0 .0021 0 .020 .0021 .055
Chinese .0094 0 0 0 0 .56 .0047
food .013 0 .011 0 0 0 0
lunch .0087 0 0 0 0 .0022 0

30
Effect of Adding One

Bigram counts after adding one
I want to eat Chinese food lunch
I 9 1088 1 14 1 1 1
want 4 1 787 1 7 9 7
to 4 1 11 861 4 1 13
eat 1 1 3 1 20 3 53
Chinese 3 1 1 1 1 121 2
food 20 1 18 1 1 1 1
lunch 5 1 1 1 1 2 1

31
Effect of Adding One

Adjusted unigram counts
I 3437 1616 5053
want 1215 1616 2931
to 3256 1616 4872
eat 938 1616 2554
Chinese 213 1616 1829
food 1506 1616 3122
lunch 459 1616 2075

32
Smoothed bigram probabilities

I want to eat Chinese food lunch
I .0018 .22 .00020 .0028 .00020 .00020
.00020
want .0014 .00035 .28 .00035 .0025 .0032
.0025
to .00082 .00021 .0023 .18 .00082 .00021
.0027
eat .00039 .00039 .0012 .00039 .0078 .0012
.021
Chinese .0016 .00055 .00055 .00055 .00055
.066 .0011
food .0064 .00032 .0058 .00032 .00032
.00032 .00032
lunch .0024 .00048 .00048 .00048 .00048
.00096 .00048

33
Smoothed bigram counts

Adjusted bigram counts
I want to eat Chinese food lunch
I 6 740 .68 10 .68 .68 .68
want 2 .42 331 .42 3 4 3
to 3 .69 8 594 3 .69 9
eat .37 .37 1 .37 7.4 1 20
Chinese .36 .12 .12 .12 .12 15 .24
food 10 .48 9 .48 .48 .48 .48
lunch 1.1 .22 .22 .22 .22 .44 .22
C(I want) 1088 ? 740 C(want to) 787 ? 331

34
Discounts

An alternative view of smoothing
Discounting (lowering) some non-zero counts in
order to get the probability mass that will be
assigned to the zero counts
Discount
I 0.68
want 0.42
to 0.69
eat 0.37

Chinese 0.12 food 0.48 lunch 0.22
35
Problems with Adding-One