Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

1 / 35
About This Presentation
Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Some basic information theory: Entropy, perplexity. Smoothing techniques ... Perplexity: 2H. Intuitively: weighted average number of choices a random variable ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics


1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
  • 5. Smoothing Techniques
  • Jonas Kuhn
  • Universität Potsdam, 2007

2
Overview
  • Evaluating language models
  • Some basic information theory Entropy,
    perplexity
  • Smoothing techniques
  • Simple smoothing techniques
  • Next time
  • More advanced smoothing techniques
  • Backoff, linear interpolation

3
Evaluating language models
  • We need evaluation metrics to determine how good
    our language models predict the next word
  • Intuition one should average over the
    probability of new words

4
Some basic information theory
  • Evaluation metrics for language models
  • Information theory measures of information
  • Entropy
  • Perplexity

5
Entropy
  • Average length of most efficient coding for a
    random variable
  • Binary encoding
  • Example betting on horses

6
Entropy
  • Example betting on horses
  • 8 horses, each horse is equally likely to win
  • (Binary) Message required
    001, 010, 011, 100, 101, 110, 111, 000
  • 3-bit message required

7
Entropy
  • 8 horses, some horses are more likely to win
  • Horse 1 ½ 0
  • Horse 2 ¼ 10
  • Horse 3 1/8 110
  • Horse 4 1/16 1110
  • Horse 5-8 1/64 111100, 111101, 111110, 111111

8
Perplexity
  • Entropy H
  • Perplexity 2H
  • Intuitively weighted average number of choices a
    random variable has to make
  • Equally likely horses Entropy 3 Perplexity 23
    8
  • Biased horses Entropy 2 Perplexity 22 4

9
Entropy of a sequence
  • Finite sequence strings from a language L
  • Entropy rate (per-word entropy)

10
Entropy of a language
  • Entropy rate of language L
  • Shannon-McMillan-Breimann Theorem
  • If a language is stationary and ergodic
  • A single sequence if it is long enough is
    representative for the language

11
Cross Entropy
  • Often we compare the quality of various models of
    some actual probability distribution p which we
    do not know
  • Language model example
  • Cross entropy H(p, m) is an upper bound on the
    entropy H(p)
  • Comparing two models m1 and m1, the one with the
    lower cross-entropy is more accurate

12
Cross Entropy for sequences
  • With Shannon-McMillan-Breimann Theorem
  • For a stationary ergodic process
  • The cross entropy H(p,m) is an upper bound on the
    entropy H(p)
  • For any model m

13
Entropy of English
  • (compare Jurafsky/Martin, sec. 6.7)
  • Per-letter entropy
  • Shannon (1951) Psychological experiments with
    human subjects guessing the next letter in a
    sequence
  • 1.3 bits (for 27 characters 26 letters plus
    space)
  • Brown et al. (1992) training of a trigram
    grammar in serveral steps
  • 1.75 bits per character (based on 95 ASCII
    characters)

14
Smoothing Techniques
  • Simple Add-One Smoothing
  • Two illustrations
  • Based on Manning/Schütze
  • Based on Jurafsky/Martin
  • (Next time) More sophisticated Smoothing
    Techniques

15
Illustration of Simple Smoothing Techniques
  • Example from Manning/Schütze, ch. 6
    (based on slides by Jonathan Henke, UC
    Berkeley)
  • Corpus five Jane Austen novels
  • N 617,091 words
  • V 14,585 unique words
  • Task predict the next word of the trigram
    inferior to ________
  • from test data, Persuasion In person, she was
    inferior to both sisters.

16
Instances in the Training Corpusinferior to
________
17
Maximum Likelihood Estimate
18
Actual Probability Distribution
19
Actual Probability Distribution
20
Smoothing
  • Develop a model which decreases probability of
    seen events and allows the occurrence of
    previously unseen N-grams
  • a.k.a. Discounting methods
  • Validation Smoothing methods which utilize a
    second batch of test data

21
LaPlaces Law (Adding One)
22
LaPlaces Law (Adding One)
23
LaPlaces Law
24
Smoothing Techniques
  • Add-One Smoothing (Laplaces Law)
  • Let us look at unigrams first
  • Unsmoothed maximum likelihood estimate
  • With Add-One Smoothing

25
Smoothing Techniques
  • Add-One Smoothing Alternative computation
  • Adjusted count c adding one to the count and
    multiplying by the following normalizing factor
  • (where V is the vocabulary
    size)
  • Adjusted count
  • Probabilities

26
Add-One Smoothing for Bigrams
  • Unsmoothed bigram probabilities
  • Add-one-smoothed version

27
Original relative frequency example
  • (Illustration from Jurafsky/Martin)
  • Bigram counts from Berkeley Restautant Project
  • I want to eat Chinese food lunch
  • I 8 1087 0 13 0 0 0
  • want 3 0 786 0 6 8 6
  • to 3 0 10 860 3 0 12
  • eat 0 0 2 0 19 2 52
  • Chinese 2 0 0 0 0 120 1
  • food 19 0 17 0 0 0 0
  • lunch 4 0 0 0 0 1 0

28
Original relative frequency example
  • Unigram counts from corpus
  • I 3437
  • want 1215
  • to 3256
  • eat 938
  • Chinese 213
  • food 1506
  • lunch 459

29
Original relative frequency example
  • Bigram probabilities (after normalizing, by
    dividing through unigram counts)
  • I want to eat Chinese food lunch
  • I .0023 .32 0 .0038 0 0 0
  • want .0025 0 .65 0 .0049 .0066 .0049
  • to .00092 0 .0031 .26 .00092 0 .0037
  • eat 0 0 .0021 0 .020 .0021 .055
  • Chinese .0094 0 0 0 0 .56 .0047
  • food .013 0 .011 0 0 0 0
  • lunch .0087 0 0 0 0 .0022 0

30
Effect of Adding One
  • Bigram counts after adding one
  • I want to eat Chinese food lunch
  • I 9 1088 1 14 1 1 1
  • want 4 1 787 1 7 9 7
  • to 4 1 11 861 4 1 13
  • eat 1 1 3 1 20 3 53
  • Chinese 3 1 1 1 1 121 2
  • food 20 1 18 1 1 1 1
  • lunch 5 1 1 1 1 2 1

31
Effect of Adding One
  • Adjusted unigram counts
  • I 3437 1616 5053
  • want 1215 1616 2931
  • to 3256 1616 4872
  • eat 938 1616 2554
  • Chinese 213 1616 1829
  • food 1506 1616 3122
  • lunch 459 1616 2075

32
Smoothed bigram probabilities
  • I want to eat Chinese food lunch
  • I .0018 .22 .00020 .0028 .00020 .00020
    .00020
  • want .0014 .00035 .28 .00035 .0025 .0032
    .0025
  • to .00082 .00021 .0023 .18 .00082 .00021
    .0027
  • eat .00039 .00039 .0012 .00039 .0078 .0012
    .021
  • Chinese .0016 .00055 .00055 .00055 .00055
    .066 .0011
  • food .0064 .00032 .0058 .00032 .00032
    .00032 .00032
  • lunch .0024 .00048 .00048 .00048 .00048
    .00096 .00048

33
Smoothed bigram counts
  • Adjusted bigram counts
  • I want to eat Chinese food lunch
  • I 6 740 .68 10 .68 .68 .68
  • want 2 .42 331 .42 3 4 3
  • to 3 .69 8 594 3 .69 9
  • eat .37 .37 1 .37 7.4 1 20
  • Chinese .36 .12 .12 .12 .12 15 .24
  • food 10 .48 9 .48 .48 .48 .48
  • lunch 1.1 .22 .22 .22 .22 .44 .22
  • C(I want) 1088 ? 740 C(want to) 787 ? 331

34
Discounts
  • An alternative view of smoothing
  • Discounting (lowering) some non-zero counts in
    order to get the probability mass that will be
    assigned to the zero counts
  • Discount
  • I 0.68
  • want 0.42
  • to 0.69
  • eat 0.37

Chinese 0.12 food 0.48 lunch 0.22
35
Problems with Adding-One
  • Too much probability mass is taken away from the
    observed events
  • Alternatives Add-one-half, add-one-thousandth
  • Lidstones Law
  • Held-out estimation
  • More sophisticated smoothing/discounting
    techniques
  • Witten-Bell Discounting
  • Good-Turing Smoothing
  • Special Methods for N-Grams
  • Backoff
  • Linear Interpolation
Write a Comment
User Comments (0)