Grammar induction by Bayesian model averaging

About This Presentation

Title:

Grammar induction by Bayesian model averaging

Description:

Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke s thesis UC Berkeley 1994 Why automatic grammar induction ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 14

Provided by: GuyLe3

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Grammar induction by Bayesian model averaging

1
Grammar induction by Bayesian model averaging

Guy Lebanon
LARG meeting
May 2001
Based on Andreas Stolckes thesis UC Berkeley 1994

2
Why automatic grammar induction (AGI)

Enables using domain-dependent grammars without
expert intervention.
Enables using person-dependent grammars without
expert intervention.
Can be used on different languages (without a
linguist familiar with the particular language).
A process of grammar induction with expert
guidance may be more accurate than human written
grammar since computers are more adept than
humans in analyzing large corpora.

3
Why statistical approaches to AGI

In practice languages are not logical structures.
Often said sentences are not precisely
grammatical. The solution of expanding the
grammar leads to explosion of grammar rules.
A large grammar will lead to many parses of the
same sentences. Clearly, some parses are more
accurate than others. Statistical approaches
enable including a large set of grammar rules
together with assigning probability to each
parse.
There are known optimality conditions and
optimization procedure in statistics.

4
Some Bayesian statistics

For each grammar (rule
probabilities rules), a prior probability p(M)
is assigned. This value may represent experts
opinion about how likely is this grammar.
Upon introduction of a training set X (an
unlabeled corpus), the model posterior is
computed by Bayes law
Either the grammar that maximizes the posterior
is kept (as the best grammar), or the set of all
grammars and their posteriors is kept (better).

5
Priors for CF grammars

The prior of a grammar p(M) is split to two
parts
The component is taken to introduce a
bias towards short grammars (less rules). One way
of doing that, though still heuristic, is minimum
description length (MDL)
Prior for the rule probabilities is taken to be
uniform Dirichlet prior which has the effect of
smoothing low counts of rules usage.

6
Grammar posterior

Too hard to maximize over the posterior of both
the rules and the probabilities. Instead, the
search is done to maximize the posterior of the
rules only
Where V is the Viterbi derivation of x. The last
integral has a closed form solution.

7
Maximizing the posterior

Even though computing an approximation to the
posterior is possible in closed form, coming up
with a grammar that maximizes it is still a hard
problem.
A. Stolcke Start with many rules. Apply greedy
operations of merging rules to maximize the
posterior.
Model merging was applied to Hidden Markov
models, probabilistic context free grammar and
probabilistic attribute grammar (PCFG with
semantic features tied to non-terminals).

8
A concrete example PCFG

A specific PCFG consists of a list of rules s and
a set of production probabilities .
For a given s, it is possible to learn the
production probabilities with EM. Coming up with
an optimal s is still an open problem. Stolckes
model merging is an attempt to tackle this
problem.
Given a corpus (set of sentences), an initial set
of rules is constructed

9
Merging operators

Non-terminal merging replace two existing
non-terminals with a single new non-terminal.
Non-terminal chunking Given an ordered sequence
of non-terminals, create a new
non-terminal Y that expands to
and replaces occurrences of in right
hand side with Y.

10
PCFG priors
Prior for rule probabilities
Prior for rules For a non-lexical rule (doesnt
produce a terminal symbol) the description length
is
For a lexical rule (produces a terminal symbol)
the description length is The prior was taken to
be either exponentially decreasing or Poisson in
the description length
11
(No Transcript)
12
Search strategy

Start with the initial rules.
Try applying all possible merge operations. For
each resulting grammar compute the posterior and
choose the merge which resulted in the highest
posterior.
Search strategy
Best first search,
Best first with look-ahead
Beam search

13
Now some examples

Write a Comment

User Comments (0)