Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Compute the perplexity of the language model, with respect to some test text b.text ... Perplexity = 128.15, Entropy = 7.00 bits. Computation based on 8842804 words. ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 34
Provided by: jonas5
Category:

less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics


1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics
  • 4. Probabilistic Language Models
  • Jonas Kuhn
  • Universität Potsdam, 2007

2
Outline
  • Viewing natural language text as the result of a
    random process
  • N-gram language models
  • Applications, Motivation
  • Assumptions
  • Training language models (1)
  • relative frequency estimates
  • Toolkits for N-gram LM training
  • Preview Training language models (2)
  • smoothing techniques
  • Python Programming Using dictionaries, counting,
    calculating relative frequencies etc.
  • Task implement a bigram and a trigram language
    model on individual characters

3
N-gram language models
  • Viewing natural language text as the result of a
    random process
  • Based on Jurafsky/Martin, ch. 6

4
Guessing the next word in a text
  • Given a sequence of words, what will be the next
    word?
  • Hard to guess but if we dont demand extremely
    high accuracy, it is not that hard
  • Id like to make a collect
  • call
  • telephone
  • international

5
Guessing the next word in a text
  • Applications
  • Speech recognition (language modeling)
  • Hand-writing recognition
  • Augmentative communication systems for the
    disabled
  • Context-sensitive spelling error correction (see
    example on next slide)
  • Further applications Statistical Machine
    Translation
  • Note There are always other information sources,
    so prediction of the next word is used to choose
    among alternative hypotheses

6
Some real-world spelling errors (Kukich 1992)
  • They are leaving in about fifteen minuets to go
    to her house.
  • The study was conducted mainly be John Black.
  • The design an construction of the system will
    take more than a year.
  • Hopefully, all with continue smoothly in my
    absence.
  • Can they lave him my messages?
  • I need to notified the bank of this problem.
  • He is trying to fine out.

7
Probability of a sequence of words
  • Two closely related problems
  • Guessing the next word
  • Computing the probability of a sequence of words

8
Counting words in corpora
  • To estimate probabilities, we need to count
    frequencies
  • What do people count?
  • Word forms
  • Lemmas
  • The type/token distinction
  • Number of (word form) types distinct words in a
    corpus (i.e., the size of the vocabulary)
  • Number of (word form) tokens total number of
    running words

9
Counting words in corpora
  • Switchboard corpus (spoken English)
  • 2.4 million word form tokens
  • c. 20,000 word form types
  • Shakespeares complete words
  • 884,647 word form tokens
  • 29,066 word form types
  • Brown corpus
  • 1 million word form tokens
  • 61,805 word form types (37,851 lemma types)

10
Estimating word probabilities
  • How probable is an English word (form) w1 as the
    next word in a sequence?
  • Simplest model every word has the same
    probability of occurring
  • Assume vocabulary size 100,000
  • Single word the probability of finding w1 is
  • Word wn in a sequence, assuming conditional
    independence from the context

11
Estimating word probabilities
  • Sequence of two words w1 w2, still assuming
  • that each word form is equally likely
  • that w1 w2 are conditionally independent from
    each other
  • that w1 w2 are conditionally independent from the
    context

12
A slightly more complex model
  • Still assume that any word can follow any other
    word
  • Take into account that different wordforms occur
    with different frequencies
  • the occurs 69,971 times in the 1,000,000 tokens
    of the Brown corpus
  • rabbit occurs 11 times in the Brown corpus
  • Austin occurs 20 times
  • linguist occurs 13 times

13
A slightly more complex model
  • Estimating probabilities based on relative
    frequency
  • Sample 1,000,000 trials of producing a random
    English word (N1,000,000)
  • Relative frequency of outcome u

14
Conditional probability of a word
  • Relative frequencies are not a good model for the
    probability of words in a given context
  • Just then, the white
  • P(the).07 P(rabbit).00001
  • We should take the previous words that have
    occurred into account
  • We will get P(rabbit white) gt P(rabbit)

15
Language Model Probability of string
  • Using the chain rule of probability
  • But how can we estimate such probabilities?
  • If we wanted to count the frequency of every word
    appearing after a long sequence of other words,
    we would need a far too large corpus as a sample

16
Language Model Probability of string
  • Approximate the probability
  • We have to form equivalence classes over word
    contexts, so we get a larger sample from which we
    estimate probabilities
  • Simple approximation look only at one preceding
    word

17
Bigram model
  • Approximate
  • P(rabbit Just the other day I saw a)
  • by
  • P(rabbit a)
  • Markov assumption predicting a future event
    based on a limited window of past events
  • Bigrams first-order Markov model (looking back
    one token into the past)

18
N-gram models
  • Bigram model
  • first-order Markov model
  • looking back one token
  • Trigram model
  • second-order Markov model
  • looking back two tokens
  • N-gram model
  • N-1th order Markov model
  • looking back N-1 tokens

19
Bigram approximation of string prob.
  • Simplifying assumption
  • Resulting equation (bigram language model)

20
Bigram language model example
  • Berkeley Restaurant Project (corpus of c. 10,000
    sentences)
  • Most likely words to follow eat
  • eat on .16
  • eat some .06
  • eat lunch .06
  • eat dinner .05
  • eat at .04
  • eat a .04
  • eat Indian .04
  • eat today .03
  • eat Thai .03
  • eat breakfast .03
  • eat in .02
  • eat Chinese .02
  • eat Mexican .02
  • eat tomorrow .01
  • eat dessert .007
  • eat British .001

21
Bigram probabilities
  • ltsgt I .25 I want .32 want to .65
  • ltsgt Id .06 I would .29 want a .05
  • ltsgt Tell .04 I dont .08 want some .04
  • ltsgt Im .02 I have .04 want thai .01
  • to eat .26 British food .60
  • to have .14 British restaurant .15
  • to spend .09 British cuisine .01
  • to be .02 British lunch .01

22
Computing the probability of a sentence
23
Training N-gram models
  • Counting and normalizing
  • Count occurrences of a bigram (say, eat lunch)
  • Divide by total count of bigrams sharing the
    first word (i.e., eat w for some w)

24
Training N-gram models
  • General case of N-gram parameter estimation
  • Relative frequency
  • Example of Maximum Likelihood Estimation (MLE)
    technique

25
Maximum Likelihood Estimation (MLE)
  • Estimating parameters in a probability model so
    that the likelihood of the training data given
    the model (i.e, P(TM) ) is maximized
  • There are better ways of estimating N-gram
    probabilities, building on top of relative
    frequencies

26
Relative frequency example
  • Bigram counts from Berkeley Restaurant Project
  • I want to eat Chinese food lunch
  • I 8 1087 0 13 0 0 0
  • want 3 0 786 0 6 8 6
  • to 3 0 10 860 3 0 12
  • eat 0 0 2 0 19 2 52
  • Chinese 2 0 0 0 0 120 1
  • food 19 0 17 0 0 0 0
  • lunch 4 0 0 0 0 1 0

27
Relative frequency example
  • Bigram probabilities (after normalizing, by
    dividing through unigram counts)
  • I want to eat Chinese food lunch
  • I .0023 .32 0 .0038 0 0 0
  • want .0025 0 .65 0 .0049 .0066 .0049
  • to .00092 0 .0031 .26 .00092 0 .0037
  • eat 0 0 .0021 0 .020 .0021 .055
  • Chinese .0094 0 0 0 0 .56 .0047
  • food .013 0 .011 0 0 0 0
  • lunch .0087 0 0 0 0 .0022 0

28
The CMU Statistical Language Modeling Toolkit
  • Version 2 available from http//mi.eng.cam.ac.uk/
    prc14/toolkit.html

29
Using the CMU SLM Toolkit
  • Given a large corpus of text in a file a.text,
    but no specified vocabulary
  • Compute the word unigram counts cat a.text
    text2wfreq gt a.wfreq
  • Convert the word unigram counts into a vocabulary
    consisting of the 20,000 most common words
  • cat a.wfreq wfreq2vocab -top 20000 \
  • gt a.vocab

30
Using the CMU SLM Toolkit
  • Generate a binary id 3-gram of the training text,
    based on this vocabularycat a.text
    text2idngram -vocab a.vocab gt a.idngram
  • Convert the idngram into a binary format language
    model idngram2lm -idngram a.idngram -vocab
    a.vocab \
  • -binary a.binlm

31
Using the CMU SLM Toolkit
  • Compute the perplexity of the language model,
    with respect to some test text b.textevallm
    -binary a.binlmReading in language model from
    file a.binlmDone.evallm perplexity -text
    b.text Computing perplexity of the language
    model with respect to the text b.text
    Perplexity 128.15, Entropy 7.00 bits
    Computation based on 8842804 words. Number of
    3-grams hit 6806674 (76.97) Number of 2-grams
    hit 1766798 (19.98) Number of 1-grams hit
    269332 (3.05) 1218322 OOVs (12.11) and 576763
    context cues were removed from the calculation.
    evallm quit

32
Using the CMU SLM Toolkit

33
Python Exercise
  • Implement bigram and trigram language model
    training program
  • N-Gram model on individual characters
  • Use Europarl data (de,en,fr,es) as training data
  • Collect a number of sample texts for testing from
    the Web
  • Implement program for applying LMs on test texts
    to determine the language of a text
  • Assume P(Language) to be uniform for all
    languages
  • Keep track of test results (we will discuss this
    in more detail)
Write a Comment
User Comments (0)
About PowerShow.com