Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

Description:

Compute the perplexity of the language model, with respect to some test text b.text ... Perplexity = 128.15, Entropy = 7.00 bits. Computation based on 8842804 words. ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 34

Provided by: jonas5

Category:

more less

Transcript and Presenter's Notes

Title: Statistische Methoden in der Computerlinguistik Statistical Methods in Computational Linguistics

1
Statistische Methoden in der ComputerlinguistikSt
atistical Methods in Computational Linguistics

4. Probabilistic Language Models
Jonas Kuhn

Universität Potsdam, 2007

2
Outline

Viewing natural language text as the result of a
random process
N-gram language models
Applications, Motivation
Assumptions
Training language models (1)
relative frequency estimates
Toolkits for N-gram LM training
Preview Training language models (2)
smoothing techniques
Python Programming Using dictionaries, counting,
calculating relative frequencies etc.
Task implement a bigram and a trigram language
model on individual characters

3
N-gram language models

Viewing natural language text as the result of a
random process
Based on Jurafsky/Martin, ch. 6

4
Guessing the next word in a text

Given a sequence of words, what will be the next
word?
Hard to guess but if we dont demand extremely
high accuracy, it is not that hard
Id like to make a collect
call
telephone
international

5
Guessing the next word in a text

Applications
Speech recognition (language modeling)
Hand-writing recognition
Augmentative communication systems for the
disabled
Context-sensitive spelling error correction (see
example on next slide)
Further applications Statistical Machine
Translation
Note There are always other information sources,
so prediction of the next word is used to choose
among alternative hypotheses

6
Some real-world spelling errors (Kukich 1992)

They are leaving in about fifteen minuets to go
to her house.
The study was conducted mainly be John Black.
The design an construction of the system will
take more than a year.
Hopefully, all with continue smoothly in my
absence.
Can they lave him my messages?
I need to notified the bank of this problem.
He is trying to fine out.

7
Probability of a sequence of words

Two closely related problems
Guessing the next word
Computing the probability of a sequence of words

8
Counting words in corpora

To estimate probabilities, we need to count
frequencies
What do people count?
Word forms
Lemmas
The type/token distinction
Number of (word form) types distinct words in a
corpus (i.e., the size of the vocabulary)
Number of (word form) tokens total number of
running words

9
Counting words in corpora

Switchboard corpus (spoken English)
2.4 million word form tokens
c. 20,000 word form types
Shakespeares complete words
884,647 word form tokens
29,066 word form types
Brown corpus
1 million word form tokens
61,805 word form types (37,851 lemma types)

10
Estimating word probabilities

How probable is an English word (form) w1 as the
next word in a sequence?
Simplest model every word has the same
probability of occurring
Assume vocabulary size 100,000
Single word the probability of finding w1 is
Word wn in a sequence, assuming conditional
independence from the context

11
Estimating word probabilities

Sequence of two words w1 w2, still assuming
that each word form is equally likely
that w1 w2 are conditionally independent from
each other
that w1 w2 are conditionally independent from the
context

12
A slightly more complex model

Still assume that any word can follow any other
word
Take into account that different wordforms occur
with different frequencies
the occurs 69,971 times in the 1,000,000 tokens
of the Brown corpus
rabbit occurs 11 times in the Brown corpus
Austin occurs 20 times
linguist occurs 13 times

13
A slightly more complex model

Estimating probabilities based on relative
frequency
Sample 1,000,000 trials of producing a random
English word (N1,000,000)
Relative frequency of outcome u

14
Conditional probability of a word

Relative frequencies are not a good model for the
probability of words in a given context
Just then, the white
P(the).07 P(rabbit).00001
We should take the previous words that have
occurred into account
We will get P(rabbit white) gt P(rabbit)

15
Language Model Probability of string

Using the chain rule of probability
But how can we estimate such probabilities?
If we wanted to count the frequency of every word
appearing after a long sequence of other words,
we would need a far too large corpus as a sample

16
Language Model Probability of string

Approximate the probability
We have to form equivalence classes over word
contexts, so we get a larger sample from which we
estimate probabilities
Simple approximation look only at one preceding
word

17
Bigram model

Approximate
P(rabbit Just the other day I saw a)
by
P(rabbit a)
Markov assumption predicting a future event
based on a limited window of past events
Bigrams first-order Markov model (looking back
one token into the past)

18
N-gram models

Bigram model
first-order Markov model
looking back one token
Trigram model
second-order Markov model
looking back two tokens
N-gram model
N-1th order Markov model
looking back N-1 tokens

19
Bigram approximation of string prob.

Simplifying assumption
Resulting equation (bigram language model)

20
Bigram language model example

Berkeley Restaurant Project (corpus of c. 10,000
sentences)
Most likely words to follow eat
eat on .16
eat some .06
eat lunch .06
eat dinner .05
eat at .04
eat a .04
eat Indian .04
eat today .03

eat Thai .03
eat breakfast .03
eat in .02
eat Chinese .02
eat Mexican .02
eat tomorrow .01
eat dessert .007
eat British .001

21
Bigram probabilities

ltsgt I .25 I want .32 want to .65
ltsgt Id .06 I would .29 want a .05
ltsgt Tell .04 I dont .08 want some .04
ltsgt Im .02 I have .04 want thai .01
to eat .26 British food .60
to have .14 British restaurant .15
to spend .09 British cuisine .01
to be .02 British lunch .01

22
Computing the probability of a sentence
23
Training N-gram models

Counting and normalizing
Count occurrences of a bigram (say, eat lunch)
Divide by total count of bigrams sharing the
first word (i.e., eat w for some w)

24
Training N-gram models

General case of N-gram parameter estimation
Relative frequency
Example of Maximum Likelihood Estimation (MLE)
technique

25
Maximum Likelihood Estimation (MLE)

Estimating parameters in a probability model so
that the likelihood of the training data given
the model (i.e, P(TM) ) is maximized
There are better ways of estimating N-gram
probabilities, building on top of relative
frequencies

26
Relative frequency example

Bigram counts from Berkeley Restaurant Project
I want to eat Chinese food lunch
I 8 1087 0 13 0 0 0
want 3 0 786 0 6 8 6
to 3 0 10 860 3 0 12
eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
food 19 0 17 0 0 0 0
lunch 4 0 0 0 0 1 0

27
Relative frequency example

Bigram probabilities (after normalizing, by
dividing through unigram counts)
I want to eat Chinese food lunch
I .0023 .32 0 .0038 0 0 0
want .0025 0 .65 0 .0049 .0066 .0049
to .00092 0 .0031 .26 .00092 0 .0037
eat 0 0 .0021 0 .020 .0021 .055
Chinese .0094 0 0 0 0 .56 .0047
food .013 0 .011 0 0 0 0
lunch .0087 0 0 0 0 .0022 0

28
The CMU Statistical Language Modeling Toolkit

Version 2 available from http//mi.eng.cam.ac.uk/
prc14/toolkit.html

29
Using the CMU SLM Toolkit

Given a large corpus of text in a file a.text,
but no specified vocabulary
Compute the word unigram counts cat a.text
text2wfreq gt a.wfreq
Convert the word unigram counts into a vocabulary
consisting of the 20,000 most common words
cat a.wfreq wfreq2vocab -top 20000 \
gt a.vocab

30
Using the CMU SLM Toolkit

Generate a binary id 3-gram of the training text,
based on this vocabularycat a.text
text2idngram -vocab a.vocab gt a.idngram
Convert the idngram into a binary format language
model idngram2lm -idngram a.idngram -vocab
a.vocab \
-binary a.binlm

31
Using the CMU SLM Toolkit

Compute the perplexity of the language model,
with respect to some test text b.textevallm
-binary a.binlmReading in language model from
file a.binlmDone.evallm perplexity -text
b.text Computing perplexity of the language
model with respect to the text b.text
Perplexity 128.15, Entropy 7.00 bits
Computation based on 8842804 words. Number of
3-grams hit 6806674 (76.97) Number of 2-grams
hit 1766798 (19.98) Number of 1-grams hit
269332 (3.05) 1218322 OOVs (12.11) and 576763
context cues were removed from the calculation.
evallm quit

32
Using the CMU SLM Toolkit

33
Python Exercise

Implement bigram and trigram language model
training program
N-Gram model on individual characters
Use Europarl data (de,en,fr,es) as training data
Collect a number of sample texts for testing from
the Web
Implement program for applying LMs on test texts
to determine the language of a text
Assume P(Language) to be uniform for all
languages
Keep track of test results (we will discuss this
in more detail)