Learning PCFGs: Estimating Parameters, Learning Grammar Rules - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Learning PCFGs: Estimating Parameters, Learning Grammar Rules

Description:

Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many s are taken or adapted from s by Dan Klein Results: Dependencies Results: Combined Models ... – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 37
Provided by: Alexand171
Category:

less

Transcript and Presenter's Notes

Title: Learning PCFGs: Estimating Parameters, Learning Grammar Rules


1
Learning PCFGs Estimating Parameters,Learning
Grammar Rules
  • Many slides are taken or adapted from slides by
  • Dan Klein

2
Treebanks
An example tree from the Penn Treebank
3
The Penn Treebank
  • 1 million tokens
  • In 50,000 sentences, each labeled with
  • A POS tag for each token
  • Labeled constituents
  • Extra information
  • Phrase annotations like TMP
  • empty constituents for wh-movement traces,
    empty subjects for raising constructions

4
Supervised PCFG Learning
  • Preprocess the treebank
  • Remove all extra information (empties, extra
    annotations)
  • Convert to Chomsky Normal Form
  • Possibly prune some punctuation, lower-case all
    words, compute word shapes, and other
    processing to combat sparsity.
  • Count the occurrence of each nonterminal c(N) and
    each observed production rule c(N-gtNL NR) and
    c(N-gtw)
  • Set the probability for each rule to the MLE
  • P(N-gtNL NR) c(N-gtNL NR) / c(N)
  • P(N-gtw) c(N-gtw) / c(N)
  • Easy, peasy, lemon-squeezy.

5
Complications
  • Smoothing
  • Especially for lexicalized grammars, many test
    productions will never be observed during
    training
  • We dont necessarily want to assign these
    productions zero probability
  • Instead, define backoff distributions, e.g.
  • Pfinal(VPtransmogrified -gt Vtransmogrified
    PPinto)
  • a P(VPtransmogrified -gt Vtransmogrified
    PPinto)
  • (1-a) P(VP -gt V PPinto)

6
Problems with Supervised PCFG Learning
  • Coming up with labeled data is hard!
  • Time-consuming
  • Expensive
  • Hard to adapt to new domains, tasks, languages
  • Corpus availability drives research (instead of
    tasks driving the research)
  • Penn Treebank took many person-years to manually
    annotate it.

7
Unsupervised Learning of PCFGS Feasible?
8
Unsupervised Learning
  • Systems take raw data and automatically detect
    data
  • Why?
  • More data is available
  • Kids learn (some aspects of) language with no
    supervision
  • Insights into machine learning and clustering

9
Grammar Induction and Learnability
  • Some have argued that learning syntax from
    positive data alone is impossible
  • Gold, 1967 non-identifiability in the limit
  • Chomsky, 1980 poverty of the stimulus
  • Surprising result its possible to get entirely
    unsupervised parsing to work (reasonably) well.

10
Learnability
  • Learnability formal conditions under which a
    class of languages can be learned
  • Setup
  • Class of languages ?
  • Algorithm H (the learner)
  • H sees a sequence X of strings x1 xn
  • H maps sequences X to languages L in ?
  • Question is for what classes ? do learners H
    exist?

11
Learnability Gold, 1967
  • Criterion Identification in the limit
  • A presentation of L is an infinite sequence of
    xs from L in which each x occurs at least once
  • A learner H identifies L in the limit if, for any
    presentation of L, from some point n onwards, H
    always outputs L
  • A class ? is identifiable in the limit if there
    is some single H which correctly identifies in
    the limit every L in ?.
  • Example L a,a,b is identifiable in the
    limit.
  • Theorem (Gold, 67) Any ? which contains all
    finite languages and at least one infinite
    language (ie is superfinite) is unlearnable in
    this sense.

12
Learnability Gold, 1967
  • Proof sketch
  • Assume ? is superfinite, H identifies ? in the
    limit
  • There exists a chain L1 ? L2 ? L8
  • Construct the following misleading sequence
  • Present strings from L1 until H outputs L1
  • Present strings from L2 until H outputs L2
  • This is a presentation of L8
  • but H never outputs L8

13
Learnability Horning, 1969
  • Problem, IIL requires that H succeeds on all
    examples, even the weird ones
  • Another criterion measure one identification
  • Assume a distribution PL(x) for each L
  • Assume PL(x) puts non-zero probability on all and
    only the x in L
  • Assume an infinite presentation of x drawn i.i.d.
    from PL(x)
  • H measure-one identifies L if the probability of
    drawing a sequence X from which H can identify
    L is 1.
  • Theorem (Horning, 69) PCFGs can be identified
    in this sense.
  • Note there can be misleading sequences, but
    they have to be (infinitely) unlikely

14
Learnability Horning, 1969
  • Proof sketch
  • Assume ? is a recursively enumerable set of
    recursive languages (e.g., the set of all PCFGs)
  • Assume an ordering on all strings x1 lt x2 lt
  • Define two sequences A and B agree through n
    iff for all xltxn, x is in A ? x is in B.
  • Define the error set E(L,n,m)
  • All sequences such that the first m elements do
    not agree with L through n
  • These are the sequences which contain early
    strings outside of L (cant happen), or which
    fail to contain all of the early strings in L
    (happens less as m increases)
  • Claim P(E(L,n,m)) goes to 0 as m goes to 8
  • Let dL(n) be the smallest m such that P(E) lt 2n
  • Let d(n) be the largest dL(n) in first n
    languages
  • Learner after d(n), pick first L that agrees
    with evidence through n
  • This can only fail for sequences X if X keeps
    showing up in E(L, n, d(n)), which happens
    infinitely often with probability zero.

15
Learnability
  • Golds results say little about real learners
    (the requirements are too strong)
  • Hornings algorithm is completely impractical
  • It needs astronomical amounts of data
  • Even measure-one identification doesnt say
    anything about tree structures
  • It only talks about learning grammatical sets
  • Strong generative vs. weak generative capacity

16
Unsupervised POS Tagging
  • Some (discouraging) experiments
  • Merialdo 94
  • Setup
  • You know the set of allowable tags for each word
    (but not frequency of each tag)
  • Learn a supervised model on k training sentences
  • Learn P(wt), P(titi-1,ti-2) on these sentences
  • On ngtk, reestimate with EM

17
Merialdo Results
18
Grammar Induction
  • Unsupervised Learning
  • of Grammars and Parameters

19
Right-branching Baseline
  • In English (but not necessarily in other
    languages), trees tend to be right-branching
  • A simple, English-specific baseline is to choose
    the right chain structure for each sentence.

20
Distributional Clustering
21
Nearest Neighbors
22
Learn PCFGs with EM Lari and Young, 1990
  • Setup
  • Full binary grammar with n nonterminals X1, ,
    Xn
  • (that is, at the beginning, the grammar has all
    possible rules)
  • Parse uniformly/randomly at first
  • Re-estimate rule expecations off of parses
  • Repeat
  • Their conclusion it doesnt really work

23
EM for PCFGs Details
  • Start with a full grammar, with all possible
    binary rules for our nonterminals N1 Nk.
    Designate one of them as the start symbol, say N1
  • Assign some starting distribution to the rules,
    such as
  • Random
  • Uniform
  • Some smart initialization techniques (see
    assigned reading)
  • E-step Take an unannotated sentence S, and
    compute, for all nonterminals N, NL, NR, and all
    terminals w
  • E(N S), E(N-gtNL NR, N is used S), E(N-gtw, N is
    used S)
  • M-step Reset rule probabilities to the MLE
  • P(N-gtNL NR) E(N-gtNL NRS) / E(N S)
  • P(N-gtw) E(N-gtw S) / E(N S)
  • Repeat 3 and 4 until rule probabilities
    stabilize, or converge

24
Definitions
This is the sum of P(T, S G) over all possible
trees T for w1m where the root is N1.
25
E-Step
  • We can define the expectations we want in terms
    of p, a, ß quantities

26
Inside Probabilities
  • Base case
  • Induction

Nj
Nl
Nr
wp
wd
wd1
wq
27
Outside Probabilities
  • Base case
  • Induction

28
Problem Model Symmetries
29
Distributional Syntax?
30
Problem Identifying Constituents
31
A nested distributional model
  • Wed like a model that
  • Ties spans to linear contexts (like
    distributional clustering)
  • Considers only proper tree structures (like
    PCFGs)
  • Has no symmetries to break (like a dependency
    model)

32
Constituent Context Model (CCM)
33
Results Constituency
34
Results Dependencies
35
Results Combined Models
36
Multilingual Results
Write a Comment
User Comments (0)
About PowerShow.com