CS60057 Speech

1 / 38
About This Presentation
Title:

CS60057 Speech

Description:

Eat Mexican .04. Eat at .02. Eat Chinese .05. Eat dinner .02. Eat in .06 ... use a generalized search algorithm, 'Powell search' see Numerical Recipes in C ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 39
Provided by: IBMU306

less

Transcript and Presenter's Notes

Title: CS60057 Speech


1
CS60057Speech Natural Language Processing
  • Autumn 2007

Lecture 7 8 August 2007
2
A Simple Example
  • P(I want to each Chinese food)
  • P(I ) P(want I) P(to want) P(eat
    to) P(Chinese eat) P(food Chinese)

3
A Bigram Grammar Fragment from BERP
4
(No Transcript)
5
  • P(I want to eat British food) P(I)
    P(wantI) P(towant) P(eatto) P(Britisheat)
    P(foodBritish) .25.32.65.26.001.60
    .000080
  • vs. I want to eat Chinese food .00015
  • Probabilities seem to capture syntactic''
    facts, world knowledge''
  • eat is often followed by an NP
  • British food is not too popular
  • N-gram models can be trained by counting and
    normalization

6
BERP Bigram Counts
7
BERP Bigram Probabilities
  • Normalization divide each row's counts by
    appropriate unigram counts for wn-1
  • Computing the bigram probability of I I
  • C(I,I)/C(all I)
  • p (II) 8 / 3437 .0023
  • Maximum Likelihood Estimation (MLE) relative
    frequency of e.g.

8
What do we learn about the language?
  • What's being captured with ...
  • P(want I) .32
  • P(to want) .65
  • P(eat to) .26
  • P(food Chinese) .56
  • P(lunch eat) .055
  • What about...
  • P(I I) .0023
  • P(I want) .0025
  • P(I food) .013

9
  • P(I I) .0023 I I I I want
  • P(I want) .0025 I want I want
  • P(I food) .013 the kind of food I want is ...

10
Approximating Shakespeare
  • As we increase the value of N, the accuracy of
    the n-gram model increases, since choice of next
    word becomes increasingly constrained
  • Generating sentences with random unigrams...
  • Every enter now severally so, let
  • Hill he late speaks or! a more to leg less first
    you enter
  • With bigrams...
  • What means, sir. I confess she? then all sorts,
    he is trim, captain.
  • Why dost stand forth thy canopy, forsooth he is
    this palpable hit the King Henry.

11
  • Trigrams
  • Sweet prince, Falstaff shall die.
  • This shall forbid it should be branded, if renown
    made it empty.
  • Quadrigrams
  • What! I will go seek the traitor Gloucester.
  • Will you not tell me who I am?

12
  • There are 884,647 tokens, with 29,066 word form
    types, in about a one million word Shakespeare
    corpus
  • Shakespeare produced 300,000 bigram types out of
    844 million possible bigrams so, 99.96 of the
    possible bigrams were never seen (have zero
    entries in the table)
  • Quadrigrams worse What's coming out looks like
    Shakespeare because it is Shakespeare

13
N-Gram Training Sensitivity
  • If we repeated the Shakespeare experiment but
    trained our n-grams on a Wall Street Journal
    corpus, what would we get?
  • This has major implications for corpus selection
    or design
  • Dynamically adapting language models to different
    genres

14
Unknown words
  • Unknown or Out of vocabulary (OOV) words
  • Open Vocabulary system model the unknown word
    by Training is as follows
  • Choose a vocabulary
  • Convert any word in training set not belonging to
    this set to
  • Estimate the probabilities for from its
    counts

15
Evaluaing n-grams - Perplexity
  • Evaluating applications (like speech recognition)
    potentially expensive
  • Need a metric to quickly evaluate potential
    improvements in a language model
  • Perplexity
  • Intuition The better model has tighter fit to
    the test data (assign higher probability to test
    data)
  • PP(W) P(w1w2wn)(-1/N)
  • (pg 14 chapter 4)

16
Some Useful Empirical Observations
  • A small number of events occur with high
    frequency
  • A large number of events occur with low frequency
  • You can quickly collect statistics on the high
    frequency events
  • You might have to wait an arbitrarily long time
    to get valid statistics on low frequency events
  • Some of the zeroes in the table are really zeros
    But others are simply low frequency events you
    haven't seen yet. How to address?

17
Smoothing None
  • Called Maximum Likelihood estimate.
  • Terrible on test data If no occurrences of
    C(xyz), probability is 0.

18
Smoothing Techniques
  • Every n-gram training matrix is sparse, even for
    very large corpora (Zipfs law)
  • Solution estimate the likelihood of unseen
    n-grams
  • Problems how do you adjust the rest of the
    corpus to accommodate these phantom n-grams?

19
SmoothingRedistributing Probability Mass

20
Smoothing Techniques
  • Every n-gram training matrix is sparse, even for
    very large corpora (Zipfs law)
  • Solution estimate the likelihood of unseen
    n-grams
  • Problems how do you adjust the rest of the
    corpus to accommodate these phantom n-grams?

21
Add-one Smoothing
  • For unigrams
  • Add 1 to every word (type) count
  • Normalize by N (tokens) /(N (tokens) V (types))
  • Smoothed count (adjusted for additions to N) is
  • Normalize by N to get the new unigram
    probability
  • For bigrams
  • Add 1 to every bigram c(wn-1 wn) 1
  • Incr unigram count by vocabulary size c(wn-1) V

22
Effect on BERP bigram counts
23
Add-one bigram probabilities
24
The problem
25
The problem
  • Add-one has a huge effect on probabilities e.g.,
    P(towant) went from .65 to .28!
  • Too much probability gets removed from n-grams
    actually encountered
  • (more precisely the discount factor

26
  • Discount ratio of new counts to old (e.g.
    add-one smoothing changes the BERP bigram
    (towant) from 786 to 331 (dc.42) and
    p(towant) from .65 to .28)
  • But this changes counts drastically
  • too much weight given to unseen ngrams
  • in practice, unsmoothed bigrams often work better!

27
Smoothing
  • Add one smoothing
  • Works very badly.
  • Add delta smoothing
  • Still very bad.

based on slides by Joshua Goodman
28
Witten-Bell Discounting
  • A zero ngram is just an ngram you havent seen
    yetbut every ngram in the corpus was unseen
    onceso...
  • How many times did we see an ngram for the first
    time? Once for each ngram type (T)
  • Est. total probability of unseen bigrams as
  • View training corpus as series of events, one for
    each token (N) and one for each new type (T)

29
  • We can divide the probability mass equally among
    unseen bigrams.or we can condition the
    probability of an unseen bigram on the first word
    of the bigram
  • Discount values for Witten-Bell are much more
    reasonable than Add-One

30
Good-Turing Discounting
  • Re-estimate amount of probability mass for zero
    (or low count) ngrams by looking at ngrams with
    higher counts
  • Nc n-grams with frequency c
  • Estimate smoothed count
  • E.g. N0s adjusted count is a function of the
    count of ngrams that occur once, N1
  • P (tfrequency
  • Assumes
  • word bigrams follow a binomial distribution
  • We know number of unseen bigrams (VxV-seen)

31
Interpolation and Backoff
  • Typically used in addition to smoothing
    techniques/ discounting
  • Example trigrams
  • Smoothing gives some probability mass to all the
    trigram types not observed in the training data
  • We could make a more informed decision! How?
  • If backoff finds an unobserved trigram in the
    test data, it will back off to bigrams (and
    ultimately to unigrams)
  • Backoff doesnt treat all unseen trigrams alike
  • When we have observed a trigram, we will rely
    solely on the trigram counts
  • Interpolation generally takes bigrams and
    unigrams into account for trigram probability

32
Backoff methods (e.g. Katz 87)
  • For e.g. a trigram model
  • Compute unigram, bigram and trigram probabilities
  • In use
  • Where trigram unavailable back off to bigram if
    available, o.w. unigram probability
  • E.g An omnivorous unicorn

33
Smoothing Simple Interpolation
  • Trigram is very context specific, very noisy
  • Unigram is context-independent, smooth
  • Interpolate Trigram, Bigram, Unigram for best
    combination
  • Find ?0
  • Almost good enough

34
Smoothing Held-out estmation
  • Finding parameter values
  • Split data into training, heldout, test
  • Try lots of different values for ?? ? on heldout
    data, pick best
  • Test on test data
  • Sometimes, can use tricks like EM (estimation
    maximization) to find values
  • Joshua Goodman I prefer to use a generalized
    search algorithm, Powell search see Numerical
    Recipes in C

based on slides by Joshua Goodman
35
Held-out estimation splitting data
  • How much data for training, heldout, test?
  • Some people say things like 1/3, 1/3, 1/3 or
    80, 10, 10 They are WRONG
  • Heldout should have (at least) 100-1000 words per
    parameter.
  • Answer enough test data to be statistically
    significant. (1000s of words perhaps)

based on slides by Joshua Goodman
36
Summary
  • N-gram probabilities can be used to estimate the
    likelihood
  • Of a word occurring in a context (N-1)
  • Of a sentence occurring at all
  • Smoothing techniques deal with problems of unseen
    words in a corpus

37
Practical Issues
  • Represent and compute language model
    probabilities on log format
  • p1 ? p2 ? p3 ? p4 exp (log p1 log p2 log
    p3 log p4)

38
Class-based n-grams
  • P(wiwi-1) P(cici-1) x P(wici)
Write a Comment
User Comments (0)