Learning PCFGs: Estimating Parameters, Learning Grammar Rules - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Learning PCFGs: Estimating Parameters, Learning Grammar Rules

Description:

Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many s are taken or adapted from s by Dan Klein Results: Dependencies Results: Combined Models ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 37

Provided by: Alexand171

Category:

more less

Transcript and Presenter's Notes

Title: Learning PCFGs: Estimating Parameters, Learning Grammar Rules

1
Learning PCFGs Estimating Parameters,Learning
Grammar Rules

Many slides are taken or adapted from slides by
Dan Klein

2
Treebanks
An example tree from the Penn Treebank
3
The Penn Treebank

1 million tokens
In 50,000 sentences, each labeled with
A POS tag for each token
Labeled constituents
Extra information
Phrase annotations like TMP
empty constituents for wh-movement traces,
empty subjects for raising constructions

4
Supervised PCFG Learning

Preprocess the treebank
Remove all extra information (empties, extra
annotations)
Convert to Chomsky Normal Form
Possibly prune some punctuation, lower-case all
words, compute word shapes, and other
processing to combat sparsity.
Count the occurrence of each nonterminal c(N) and
each observed production rule c(N-gtNL NR) and
c(N-gtw)
Set the probability for each rule to the MLE
P(N-gtNL NR) c(N-gtNL NR) / c(N)
P(N-gtw) c(N-gtw) / c(N)
Easy, peasy, lemon-squeezy.

5
Complications

Smoothing
Especially for lexicalized grammars, many test
productions will never be observed during
training
We dont necessarily want to assign these
productions zero probability
Instead, define backoff distributions, e.g.
Pfinal(VPtransmogrified -gt Vtransmogrified
PPinto)
a P(VPtransmogrified -gt Vtransmogrified
PPinto)
(1-a) P(VP -gt V PPinto)

6
Problems with Supervised PCFG Learning

Coming up with labeled data is hard!
Time-consuming
Expensive
Hard to adapt to new domains, tasks, languages
Corpus availability drives research (instead of
tasks driving the research)
Penn Treebank took many person-years to manually
annotate it.

7
Unsupervised Learning of PCFGS Feasible?
8
Unsupervised Learning

Systems take raw data and automatically detect
data
Why?
More data is available
Kids learn (some aspects of) language with no
supervision
Insights into machine learning and clustering

9
Grammar Induction and Learnability

Some have argued that learning syntax from
positive data alone is impossible
Gold, 1967 non-identifiability in the limit
Chomsky, 1980 poverty of the stimulus
Surprising result its possible to get entirely
unsupervised parsing to work (reasonably) well.

10
Learnability

Learnability formal conditions under which a
class of languages can be learned
Setup
Class of languages ?
Algorithm H (the learner)
H sees a sequence X of strings x1 xn
H maps sequences X to languages L in ?
Question is for what classes ? do learners H
exist?

11
Learnability Gold, 1967

Criterion Identification in the limit
A presentation of L is an infinite sequence of
xs from L in which each x occurs at least once
A learner H identifies L in the limit if, for any
presentation of L, from some point n onwards, H
always outputs L
A class ? is identifiable in the limit if there
is some single H which correctly identifies in
the limit every L in ?.
Example L a,a,b is identifiable in the
limit.
Theorem (Gold, 67) Any ? which contains all
finite languages and at least one infinite
language (ie is superfinite) is unlearnable in
this sense.

12
Learnability Gold, 1967

Proof sketch
Assume ? is superfinite, H identifies ? in the
limit
There exists a chain L1 ? L2 ? L8
Construct the following misleading sequence
Present strings from L1 until H outputs L1
Present strings from L2 until H outputs L2
This is a presentation of L8
but H never outputs L8

13
Learnability Horning, 1969

Problem, IIL requires that H succeeds on all
examples, even the weird ones
Another criterion measure one identification
Assume a distribution PL(x) for each L
Assume PL(x) puts non-zero probability on all and
only the x in L
Assume an infinite presentation of x drawn i.i.d.
from PL(x)
H measure-one identifies L if the probability of
drawing a sequence X from which H can identify
L is 1.
Theorem (Horning, 69) PCFGs can be identified
in this sense.
Note there can be misleading sequences, but
they have to be (infinitely) unlikely

14
Learnability Horning, 1969

Proof sketch
Assume ? is a recursively enumerable set of
recursive languages (e.g., the set of all PCFGs)
Assume an ordering on all strings x1 lt x2 lt
Define two sequences A and B agree through n
iff for all xltxn, x is in A ? x is in B.
Define the error set E(L,n,m)
All sequences such that the first m elements do
not agree with L through n
These are the sequences which contain early
strings outside of L (cant happen), or which
fail to contain all of the early strings in L
(happens less as m increases)
Claim P(E(L,n,m)) goes to 0 as m goes to 8
Let dL(n) be the smallest m such that P(E) lt 2n
Let d(n) be the largest dL(n) in first n
languages
Learner after d(n), pick first L that agrees
with evidence through n
This can only fail for sequences X if X keeps
showing up in E(L, n, d(n)), which happens
infinitely often with probability zero.

15
Learnability

Golds results say little about real learners
(the requirements are too strong)
Hornings algorithm is completely impractical
It needs astronomical amounts of data
Even measure-one identification doesnt say
anything about tree structures
It only talks about learning grammatical sets
Strong generative vs. weak generative capacity

16
Unsupervised POS Tagging

Some (discouraging) experiments
Merialdo 94
Setup
You know the set of allowable tags for each word
(but not frequency of each tag)
Learn a supervised model on k training sentences
Learn P(wt), P(titi-1,ti-2) on these sentences
On ngtk, reestimate with EM

17
Merialdo Results
18
Grammar Induction

Unsupervised Learning
of Grammars and Parameters

19
Right-branching Baseline

In English (but not necessarily in other
languages), trees tend to be right-branching
A simple, English-specific baseline is to choose
the right chain structure for each sentence.

20
Distributional Clustering
21
Nearest Neighbors
22
Learn PCFGs with EM Lari and Young, 1990

Setup
Full binary grammar with n nonterminals X1, ,
Xn
(that is, at the beginning, the grammar has all
possible rules)
Parse uniformly/randomly at first
Re-estimate rule expecations off of parses
Repeat
Their conclusion it doesnt really work

23
EM for PCFGs Details

Start with a full grammar, with all possible
binary rules for our nonterminals N1 Nk.
Designate one of them as the start symbol, say N1
Assign some starting distribution to the rules,
such as
Random
Uniform
Some smart initialization techniques (see
assigned reading)
E-step Take an unannotated sentence S, and
compute, for all nonterminals N, NL, NR, and all
terminals w
E(N S), E(N-gtNL NR, N is used S), E(N-gtw, N is
used S)
M-step Reset rule probabilities to the MLE
P(N-gtNL NR) E(N-gtNL NRS) / E(N S)
P(N-gtw) E(N-gtw S) / E(N S)
Repeat 3 and 4 until rule probabilities
stabilize, or converge