Overview of PCFGs and the insideoutside algorithms - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Overview of PCFGs and the insideoutside algorithms

Description:

start at non-terminal symbols and 'recurse' down trying to find a parse of a given sentence. ... the long np in Big brown funny dogs are hungry ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 48
Provided by: ssliEeWa
Category:

less

Transcript and Presenter's Notes

Title: Overview of PCFGs and the insideoutside algorithms


1
Overview of PCFGs and the inside/outside
algorithms
based on the book by E. Charniak, 1993
  • Presented by Jeff Bilmes

2
Outline
  • Shift/Reduce Chart Parsing
  • PCFGs
  • Inside/Outside
  • inside
  • outside
  • use for EM training.

3
Shift-Reduce Parsing
  • Top down parsing
  • start at non-terminal symbols and recurse down
    trying to find a parse of a given sentence.
  • Can get stuck in loops (left-recursion)
  • Bottom-up parsing
  • start at the terminal level and build up
    non-terminal constituents as we go along.
  • shift-reduce parsing
  • (used often in machine languages which are
    designed to be as unambiguous as possible) are
    quite useful
  • Find sequences of terminals that match right-hand
    side of CFG productions, and reduce them to
    non-terminals. Do same for sequences of
    terminal/non-terminals to reduce to
    non-terminals.

4
Example Arithmetic
5
(No Transcript)
6
Shift-Reduce Parsing
  • This is great for compiler writers (grammars are
    often unambiguous, and sometimes can be designed
    to not be).
  • Problems ambiguity. One rule might suggest a
    shift while another might suggest a reduce.
  • S ? if E then S if E then S else S
  • On stack if E then S
  • Input else
  • Should we shift getting 2nd rule or reduce
    getting 1st rule??
  • Can get reduce-reduce conflicts as well, given
    two production rules of form A ? ? B ? ?
  • Solution backtracking (a general search method)
  • In ambiguous situations, choose one (using a
    heuristic)
  • whenever we get to situation where we cant
    parse, we backtrack to location of ambiguity

7
Chart Parsing vs. Shift-Reduce
  • Shift-reduce parsing computation wasted if
    back-tracking occurs
  • the long np in Big brown funny dogs are hungry
    remains parsed while we search for next
    constituent, but it becomes undone if a category
    below it on the stack would need to be reparsed
  • Chart parsing avoids reparsing constituents that
    have already been found grammatical by storing
    all grammatical substrings for duration of parse.

8
Chart Parsing
  • Uses three main data structures
  • 1) a key list, list of next constituents to
    insert in chart (e.g., terminals or
    non-terminals)
  • 2) a chart (a triangular table that keeps track
    of starting position (x-axis) and length (y-axis)
    of a constituent
  • 3) set of edges (things that mark positions on
    the chart rules/productions, and how far along in
    each rule the parse currently is (it is this last
    category that helps avoid reparsing).
  • We can use chart-parsing for producing (and
    summing) parses for PCFGs and for producing
    inside/outside probabilities. But what are these?

9
PCFGs
  • A CFG with a P in front.
  • A (typically inherently ambiguous) grammar with
    probabilities associated with each production
    rule for a non-terminal symbol.

10
PCFGs
  • Example parse
  • Prob of this parse is
  • 0.80.20.40.050.450.30.40.40.53.5e-10

0.8
0.3
0.2
0.4
0.4
0.05
0.45
0.4
0.5
11
PCFGs
  • Gives probabilities for string of words, where
    t1n varies over all possible parse trees for
    word string w1n

12
Domination Notation
  • Non-terminal dominating terminal symbols
  • So in general, Njk,l means that the jth
    non-terminal (Nj) dominates terminals starting at
    k and ending at l (inclusive).

13
Probability Rules
  • starting at k and ending at l, we get
  • for any k,l and any m,n,q k m lt n q lt
    l
  • X,Y, , Z can correspond to any valid set of
    terminals or non-terminals that partition the
    terminals from k to l.

14
Chomsky Hierarchy (reminder)
15
Regular grammar (reminder)
  • Generates regular languages (e.g., regular
    expressions without bells/whistles, FSAs, etc.)
  • A?a, where A is a non-terminal and a is a
    terminal.
  • A?aB, where A and B are non-terminals and a is a
    terminal.
  • A? ?, where A is a non-terminal.
  • Ex anbm m,n gt 0 can be generated by
  • S?aA
  • A?aA
  • A?bB
  • B?bB
  • B? ?

16
Context-Free (reminder)
  • Generates context-free languages
  • Left-hand side of production rule can be only one
    non-terminal symbol
  • Every rule of the form
  • A ? ?
  • where ? is a string of terminals or
    non-terminals.
  • context free since non-terminal A can always be
    replaced by ? regardless of the context in which
    A occurs.

17
Context-Free (reminder)
  • Example generate anbn n 0
  • A ? aAb
  • A??
  • Above language cant be generated by regular
    grammar (language is not regular)
  • Often used for compilers
  • No probabilities (yet).

18
Context-Sensitive (reminder)
  • Generates context-sensitive languages
  • Generated by rules of the form
  • ? A ? ? ? ? ?
  • A? e
  • where A is a non-terminal, ? is any string of
    terminals/non-terminals, ? is any string of
    terminals/non-terminals, and ? is also any
    non-empty string of terminals-non-terminals.

19
Context-Sensitive (reminder)
  • Example anbncn n 1
  • A ? abc
  • A? aABc
  • cB ? Bc
  • bB ? bb
  • Example 2
  • an n prime

20
Type-0 (unrestricted) grammar
  • all languages that can be recognized by a Turing
    machine (i.e., ones that the TM can say yes, and
    then halt).
  • Known as recursively enumerable languages.
  • More general than context-sensitive.
  • Example (I think)
  • anbm n prime, m is prime following n
  • aabbb, aaabbbbb,aaaaabbbbbbb, etc.

21
Independence Assumptions
  • PCFGs are CFGs with probabilities, but there are
    other statistical independence assumptions made
    about random events.
  • Probabilities are unaffected by certain parts of
    the tree given certain other non-terminals.

22
Example
  • Following parse tree has equation

B1,3 is outside C4,5
A1,5 is above C4,5
23
Chomsky vs. Conditional Independence
  • The fact that it is context-free means that only
    certain production rules can occur in the grammar
    specification. The context-free determines the
    set of possible trees, not their probabilities.
  • This alone does not mean that the probabilities
    may not be influenced by parses outside a given
    non-terminal. This is where we need the
    conditional independence assumptions over
    probabilistic events.
  • The events are parse events or just dominance
    events. E.g., without independence assumptions,
    p(w4,w5C4,5,B2,3) might not be the same as
    p(w4,w5C4,5,B1,3), or could even change
    depending on how we parse some other set of
    constituents.

24
Issues
  • We want to be able to compute
  • But number of parses of a given sentence can grow
    exponentially in the length of the sentence, for
    a given grammar ? naïve summing wont work.

25
Problems to Solve
  • 1) compute probability of a sentence for a given
    grammar, p(w1nG)
  • 2) find most likely parse for a given sentence.
  • 3) Train grammar probabilities in some good way
    (e.g., maximum likelihood).
  • These can all be solved by inside-outside
    algorithm.

26
A Note on comparisons to HMMs
  • Text says that HMMs and probabilistic regular
    grammars (PRG) assign different probabilities to
    strings.
  • PRG is such that summing probabilities of all
    sentences of all lengths should be one.
  • HMM says that summing probabilities of all
    sentences of a given length T should be one.
  • But HMMs have an implicit conditional on T. HMM
    probability is

27
A Note on comparisons to HMMs
  • But HMMs can be defined to have an explicit
    length distribution
  • This can be done implicitly in HMM by having a
    start and a stop state, thus defining P(T).
  • We can alternatively explicitly make P(T) equal
    to the PRG probability of summing over all
    sentences of length T.

28
Example PRG its parses
29
forward/backward in HMMs
  • Forward (alpha) recursion
  • Backward (Beta) recursion
  • Probability of a sentence

30
association of ?/? with outside/inside
  • For PRG, backwards ? probability is like the
    stuff below a given non-terminal, or perhaps
    inside the non-terminal.
  • For PRG, forwards ? probability is like the stuff
    above a given non-terminal, or perhaps outside
    the non-terminal.

31
Inside/Outside (?/?) probability defs.
  • Inside probability the stuff that is inside (or
    that is dominated by) a given non-terminal,
    corresponding to the terminals in the range k
    through l.
  • Outside probability the stuff that is outside
    the terminal range k through l.

32
Inside/Outside probability defs.
Outside Nj
Inside Nj
33
Inside/Outside probability
  • Like in HMMs, we can get sentence probabilities
    given inside outside probabilities for any
    non-terminal.

34
Inside probability computation
  • For Chomsky-Normal form
  • for more general case, see todays handout from
    Charniak text, need to use details about edges in
    chart.
  • Base case (to then go from terminal on up the
    parse tree)
  • Next case consider range k through l, and all
    non-terminals that could generate wkl. Since
    Chomsky normal form, we have two non-terminals Np
    and Nq, but we must consider all possible
    terminal ranges within k through l so that these
    two non-terminals jointly dominate k through l.

35
Inside probability computation
sum over all pairs of non-terminals p,q and
location of split m.
by def. of chain rule of probability.
PCFG conditional independence
Rename
36
Inside probability computation
  • m unique production rules
  • n length of string
  • computational complexity is
  • O(n3m3)

37
Outside probability
  • Outside probability
  • it can also be used to compute the word
    probability, as follows
  • how do we get this?

38
Outside probability computation
  • Note again this is true for any k at all.
  • So how do we compute outside probabilities?
  • Inside probabilities are calculated bottom up,
    and (not surprisingly), outside probabilities are
    calculated top down, so we start at the top

un-marginalizing all non-terminals j.
chain rule
conditional independence and rename
39
Outside probability computation
  • Starting at the top
  • Next, to calculate outside probability for Njk,j,
    (and considering weve got Chomsky normal form),
    there are two possibilities that this could have
    come from higher constituents, namely Nj is
    either on the left or the right of its sibling

These two events are exclusive and exhaustive,
so this means well be needing to sum
probabilities of the two cases.
40
Outside probability computation
41
Outside probability computation
42
Outside probability computation
  • m unique production rules
  • n length of string
  • computational complexity is
  • O(n3m3)

43
Other notes
  • We can form products of inside/outside values
    (but they have slightly different meaning than
    with HMMs).
  • So final sum is the probability of the word
    sequence, and some non-terminal constituent
    spanning k through l (diff than HMM, which is
    just the observations). We get

follows again from conditional indep.
44
Training PCFG probabilities
  • inside/outside procedure can be used to compute
    expected sufficient statistics (e.g., posteriors
    over production rules) so that we can use these
    in an EM iterative update procedure.
  • this is very useful when we have a database of
    text, but we dont have a database of observed
    parses for this (i.e., when we dont have a
    treebank). We can use EM inside/outside to find
    a maximal likelihood solution to the probability
    values.
  • All we need are equations that give us expected
    counts.
  • note book uses notation and mentions ratios of
    actual counts C, but they really are expected
    counts Ecount which are sums of probabilities,
    as in normal EM. Recall expected value of an
    indicator is its probability.
  • E1event Pr(event)

45
Training PCFG probabilities
  • First, we need definition of posterior as ratio
    of expected counts for production
  • Next we need expected counts

46
Training PCFG probabilities
  • Lastly, we need terminal node expected counts
    that non-terminal Ni produced vocab word wj

47
Summary
  • PCFGs are powerful
  • Good training algorithms for them.
  • But are they enough? Context-sensitive grammars
    were really developed for natural (written and
    spoken) language, but algorithms for them not
    efficient.
Write a Comment
User Comments (0)
About PowerShow.com