Maximum Entropy Model - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Maximum Entropy Model

Description:

Examples of the application of ME in NLP. An overview of ... in a sequence of words uttered by a speaker: ... Nunquam ponenda est pluralitas sine necesitate ' ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 43

Provided by: che100

Category:

more less

Transcript and Presenter's Notes

Title: Maximum Entropy Model

1
Maximum Entropy Model Its Application in NLP

Luo Weihua
Software Division, ICT
2002-12-26

2
Main content

An overview of ME philosophy
Mathematical structure of MEM
Algorithm of estimation of parameters
Method for feature selection
Examples of the application of ME in NLP

3
An overview of ME philosophy

Essential tasks of statistical modelling
?Feature selection
in a sequence of words uttered by a speaker
informative statisticaverage number of words in
sentencespercentage of nounsthe number of times
of each different words used
?Model selection
predicts the future output of the process

4
An overview of ME philosophy

A simple example-translation decision
Constraints
p(dans) p(en) p(à) p(au cours de)
p(pendant) 1
p(dans) 1 ?
p(dans) 0.5 and p(en) 0.5?
p(dans) p(en) p(à) p(au cours de)
p(pendant) 1/5?

in
dans
en
à
au cours de
pendant
5
An overview of ME philosophy

Update 1
p(dans) p(en) 3/10
p(dans) p(en) p(à) p(au cours de)
p(pendant) 1
? p(dans) p(en) 3/20
p(à) p(au cours de) p(pendant) 7/30
Update 2
p(dans) p(à) 1/2
p(dans) p(en) 3/10
p(dans) p(en) p(à) p(au cours de)
p(pendant) 1

6
An overview of ME philosophy

Occams Razor
Nunquam ponenda est pluralitas sine necesitate
Laplaces Principle of Insufficient Reason when
one has no information to distinguish between the
probability of two events the best strategy is to
consider them equally likely
E.T. Jayness Principle of Maximum Entropy

7
Maximum Entropy Modelling

Contextual information and its representation
Feature Function

8
Maximum Entropy Modelling

Constraint equation/constraint
Simplification
p(y,x)p(yx)(x)
?

9
Maximum Entropy Modelling

Questionthe collection is vast?
Solutionseparate the wheat from chaff
Note Feature VS. Constraint ?
feature a binary valued function of (y,x)
constraint an equation that imposes quality
between the expected value of the feature
function in the model and its expected value in
the data

10
The maximum entropy principle

Given
Task devise a model based on the constraints

11
The maximum entropy principle

What does uniform mean?
? Shannon Entropy
Principle of ME
To select a model p from a set C of allowed
probability distributions, choose the
distribution pME C with maximum entropy H(p)
pME argmaxp?CH(p)

12
Parametric Form

Preliminary
?Definition 1
?Definition 2
Lagrange Multiplier Analysis
Suppose ? maximize ?(?), then p? (yx) is the
model in C that maximizes H(p)

13
Parametric Form

Relation to Maximum Likelihood
Re-interpretation in Log-Likehood
The model p?C with maximum entropy is the the
model in the parametric family p?(yx) that
maximizes the likelihood of the sample

14
Parameters Estimation

Improved Iterative Scaling Algorithm
?Start with ?i1 for all i?(1n)
?Repeat step 3 until all ?i values have
converged to within a preset threshold.
?For each i?(1,2,n) do the following
(a) Let ? ?i be the solution to the following
equation
where
(b)Update the value of ?i according to

15
Feature Selection

A universe of constraints for translation of in
An example
in ? rhinocéros

16
Feature Selection

Two objectives in feature selection
?sufficiently rich to capture as much
information about the random process as possible
?robust enough to generalize to new data
Two categories of constraints
?likely occur in other samples of output from
the same process
?just a random fluctuation of the data

17
Feature Selection

An intuitionistic solutionselect the contraints
corresponding to features observed many times in
empirical data
Its danger(!)
Epf11000/2000 Epf11/2
Epf1?21001/2000

18
Feature Selection

Incremental feature selection algorithm
?Begin with S empty
?Compute the maximum entropy model pS(yx) that
is consistent with all the constraints in S
?For each constraint f not in S, evalute the
gain G(f) of adding f to S as followings
(a) Compute the maximum entropy model pS?f that
is consistent with f and the constranits already
in S.
(b) Let G(f) be the increase in conditional
log-likelihood G(f)L(pS?f )-L(pS)

19
Feature Selection

?Select the constraint f with the largest gain.
If this gain is statistically significant, then
add f to S and go to step 2. Otherwise, stop
Efficiently Computing Gains
?Bottleneck 3(a) exorbitant albeit exact
computation
?Restrictionall values change with the
imposition of a new constraint
?Assumptionaddition of a new feature does not
affect the optimal values for the parameters
associated with the other features

20
Feature Selection

Approximation
where
?maxargmaxL(pS?f )
?

21
Case studies

Word Translation
Segmentation
Word Reordering

22
Word Translation

Background
Consider
Viterbi alignment

EF?unwieldy?
23
Word Translation

Estimation
p(ne), the probability that the English word e
generates n French words
p(fe), the probability that the English word e
generates the French word f
distortion probabilities, d(F,AE) which
model how the words are reordered in the French
sentence after they are generated

24
Word Translation

Brown Model EM algorithm

25
Word Translation

major shortcoming of Brown Model probabilities
are context-free

26
Word Translation

Modelling p(fe) with MEM
Improvement
Form for the features

27
Word Translation

feature templates
Training events

28
Word Translation

Consider

Template 1 feature
29
Word Translation

Construction of a ME model pin(fx)
Sample 10000 instances in Hansard corpus

30
Word Translation

ME model to predict French Translation of to run

31
Word Translation

Improved French-to-English translations resulting

32
Segmentation

Definition
rifta position i in a French sentence such that
for all jlti, aj?ai and for all jgti, aj?ai
Task
ri 1, if and only if a rift belongs at fi
Modelling p(rix) denoted by p(riftix)

33
Segmentation