Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics
1Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
- 5. Smoothing Techniques
- Jonas Kuhn
- Universität Potsdam, 2007
2Overview
- Evaluating language models
- Some basic information theory Entropy,
perplexity - Smoothing techniques
- Simple smoothing techniques
- Next time
- More advanced smoothing techniques
- Backoff, linear interpolation
3Evaluating language models
- We need evaluation metrics to determine how good
our language models predict the next word - Intuition one should average over the
probability of new words
4Some basic information theory
- Evaluation metrics for language models
- Information theory measures of information
- Entropy
- Perplexity
5Entropy
- Average length of most efficient coding for a
random variable - Binary encoding
- Example betting on horses
6Entropy
- Example betting on horses
- 8 horses, each horse is equally likely to win
- (Binary) Message required
001, 010, 011, 100, 101, 110, 111, 000 - 3-bit message required
7Entropy
- 8 horses, some horses are more likely to win
- Horse 1 ½ 0
- Horse 2 ¼ 10
- Horse 3 1/8 110
- Horse 4 1/16 1110
- Horse 5-8 1/64 111100, 111101, 111110, 111111
8Perplexity
- Entropy H
- Perplexity 2H
- Intuitively weighted average number of choices a
random variable has to make - Equally likely horses Entropy 3 Perplexity 23
8 - Biased horses Entropy 2 Perplexity 22 4
9Entropy of a sequence
- Finite sequence strings from a language L
- Entropy rate (per-word entropy)
10Entropy of a language
- Entropy rate of language L
- Shannon-McMillan-Breimann Theorem
- If a language is stationary and ergodic
- A single sequence if it is long enough is
representative for the language
11Cross Entropy
- Often we compare the quality of various models of
some actual probability distribution p which we
do not know - Language model example
- Cross entropy H(p, m) is an upper bound on the
entropy H(p) - Comparing two models m1 and m1, the one with the
lower cross-entropy is more accurate
12Cross Entropy for sequences
- With Shannon-McMillan-Breimann Theorem
- For a stationary ergodic process
- The cross entropy H(p,m) is an upper bound on the
entropy H(p) - For any model m
-
13Entropy of English
- (compare Jurafsky/Martin, sec. 6.7)
- Per-letter entropy
- Shannon (1951) Psychological experiments with
human subjects guessing the next letter in a
sequence - 1.3 bits (for 27 characters 26 letters plus
space) - Brown et al. (1992) training of a trigram
grammar in serveral steps - 1.75 bits per character (based on 95 ASCII
characters) -
14Smoothing Techniques
- Simple Add-One Smoothing
- Two illustrations
- Based on Manning/Schütze
- Based on Jurafsky/Martin
- (Next time) More sophisticated Smoothing
Techniques
15Illustration of Simple Smoothing Techniques
- Example from Manning/Schütze, ch. 6
(based on slides by Jonathan Henke, UC
Berkeley) - Corpus five Jane Austen novels
- N 617,091 words
- V 14,585 unique words
- Task predict the next word of the trigram
inferior to ________ - from test data, Persuasion In person, she was
inferior to both sisters.
16Instances in the Training Corpusinferior to
________
17Maximum Likelihood Estimate
18Actual Probability Distribution
19Actual Probability Distribution
20Smoothing
- Develop a model which decreases probability of
seen events and allows the occurrence of
previously unseen N-grams - a.k.a. Discounting methods
- Validation Smoothing methods which utilize a
second batch of test data
21LaPlaces Law (Adding One)
22LaPlaces Law (Adding One)
23LaPlaces Law
24Smoothing Techniques
- Add-One Smoothing (Laplaces Law)
- Let us look at unigrams first
- Unsmoothed maximum likelihood estimate
- With Add-One Smoothing
-
25Smoothing Techniques
- Add-One Smoothing Alternative computation
- Adjusted count c adding one to the count and
multiplying by the following normalizing factor - (where V is the vocabulary
size) - Adjusted count
- Probabilities
26Add-One Smoothing for Bigrams
- Unsmoothed bigram probabilities
- Add-one-smoothed version
27Original relative frequency example
- (Illustration from Jurafsky/Martin)
- Bigram counts from Berkeley Restautant Project
- I want to eat Chinese food lunch
- I 8 1087 0 13 0 0 0
- want 3 0 786 0 6 8 6
- to 3 0 10 860 3 0 12
- eat 0 0 2 0 19 2 52
- Chinese 2 0 0 0 0 120 1
- food 19 0 17 0 0 0 0
- lunch 4 0 0 0 0 1 0
28Original relative frequency example
- Unigram counts from corpus
- I 3437
- want 1215
- to 3256
- eat 938
- Chinese 213
- food 1506
- lunch 459
29Original relative frequency example
- Bigram probabilities (after normalizing, by
dividing through unigram counts) - I want to eat Chinese food lunch
- I .0023 .32 0 .0038 0 0 0
- want .0025 0 .65 0 .0049 .0066 .0049
- to .00092 0 .0031 .26 .00092 0 .0037
- eat 0 0 .0021 0 .020 .0021 .055
- Chinese .0094 0 0 0 0 .56 .0047
- food .013 0 .011 0 0 0 0
- lunch .0087 0 0 0 0 .0022 0
30Effect of Adding One
- Bigram counts after adding one
-
- I want to eat Chinese food lunch
- I 9 1088 1 14 1 1 1
- want 4 1 787 1 7 9 7
- to 4 1 11 861 4 1 13
- eat 1 1 3 1 20 3 53
- Chinese 3 1 1 1 1 121 2
- food 20 1 18 1 1 1 1
- lunch 5 1 1 1 1 2 1
31Effect of Adding One
- Adjusted unigram counts
- I 3437 1616 5053
- want 1215 1616 2931
- to 3256 1616 4872
- eat 938 1616 2554
- Chinese 213 1616 1829
- food 1506 1616 3122
- lunch 459 1616 2075
32Smoothed bigram probabilities
-
- I want to eat Chinese food lunch
- I .0018 .22 .00020 .0028 .00020 .00020
.00020 - want .0014 .00035 .28 .00035 .0025 .0032
.0025 - to .00082 .00021 .0023 .18 .00082 .00021
.0027 - eat .00039 .00039 .0012 .00039 .0078 .0012
.021 - Chinese .0016 .00055 .00055 .00055 .00055
.066 .0011 - food .0064 .00032 .0058 .00032 .00032
.00032 .00032 - lunch .0024 .00048 .00048 .00048 .00048
.00096 .00048
33Smoothed bigram counts
- Adjusted bigram counts
- I want to eat Chinese food lunch
- I 6 740 .68 10 .68 .68 .68
- want 2 .42 331 .42 3 4 3
- to 3 .69 8 594 3 .69 9
- eat .37 .37 1 .37 7.4 1 20
- Chinese .36 .12 .12 .12 .12 15 .24
- food 10 .48 9 .48 .48 .48 .48
- lunch 1.1 .22 .22 .22 .22 .44 .22
- C(I want) 1088 ? 740 C(want to) 787 ? 331
34Discounts
- An alternative view of smoothing
- Discounting (lowering) some non-zero counts in
order to get the probability mass that will be
assigned to the zero counts - Discount
- I 0.68
- want 0.42
- to 0.69
- eat 0.37
Chinese 0.12 food 0.48 lunch 0.22
35Problems with Adding-One
- Too much probability mass is taken away from the
observed events - Alternatives Add-one-half, add-one-thousandth
- Lidstones Law
- Held-out estimation
- More sophisticated smoothing/discounting
techniques - Witten-Bell Discounting
- Good-Turing Smoothing
- Special Methods for N-Grams
- Backoff
- Linear Interpolation