Weighted Deduction as a Programming Language - PowerPoint PPT Presentation

About This Presentation
Title:

Weighted Deduction as a Programming Language

Description:

co-authors on various parts of this work: ... MELISMA. MSA of protein structures. C . 3524. 7620. 50. MUSTANG. MSA of amino acid seqs ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 191
Provided by: jasone2
Learn more at: https://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: Weighted Deduction as a Programming Language


1
Weighted Deductionas a Programming Language
  • Jason Eisner

co-authors on various parts of this work Eric
Goldlust, Noah A. Smith, John Blatz, Wes Filardo,
Wren Thornton
CMU and Google, May 2008
2
An Anecdote from ACL05
-Michael Jordan
3
An Anecdote from ACL05
-Michael Jordan
4
Conclusions to draw from that talk
  • Mike his students are great.
  • Graphical models are great.(because theyre
    flexible)
  • Gibbs sampling is great.(because it works with
    nearly any graphical model)
  • Matlab is great.(because it frees up Mike and
    his students to doodle all day and then execute
    their doodles)

5
Could NLP be this nice?
  • Mike his students are great.
  • Graphical models are great.(because theyre
    flexible)
  • Gibbs sampling is great.(because it works with
    nearly any graphical model)
  • Matlab is great.(because it frees up Mike and
    his students to doodle all day and then execute
    their doodles)

6
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
7
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
  • Maybe a bit smaller outside NLP
  • But still big and carefully engineered
  • And will get bigger, e.g., as machine vision
    systems do more scene analysis and compositional
    object modeling

8
Systems are big!Large-scale noisy data, complex
models, search approximations, software
engineering
  • Consequences
  • Barriers to entry
  • Small number of players
  • Significant investment to be taken seriously
  • Need to know implement the standard tricks
  • Barriers to experimentation
  • Too painful to tear up and reengineer your old
    system, to try a cute idea of unknown payoff
  • Barriers to education and sharing
  • Hard to study or combine systems
  • Potentially general techniques are described and
    implemented only one context at a time

9
How to spend ones life?
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures memory layout, file formats,
integerization, prioritization of partial
solutions (best-first, A) lazy k-best, forest
reranking parameter management inside-outside
formulas, gradients, different algorithms for
training and decoding conjugate gradient,
annealing, ... parallelization
I thought computers were supposed to automate
drudgery
10
Solution
  • Presumably, we ought toadd another layer of
    abstraction.
  • After all, this is CS.
  • Hope to convince you thata substantive new layer
    exists.
  • But what would it look like?
  • Whats shared by many programs?

11
Can toolkits help?
12
Can toolkits help?
  • Hmm, there are a lot of toolkits.
  • And theyre big too.
  • Plus, they dont always cover what you want.
  • Which is why people keep writing them.
  • E.g., I love use OpenFST and have learned lots
    from its implementation! But sometimes I also
    want ...
  • So what is common across toolkits?
  • automata with gt 2 tapes
  • infinite alphabets
  • parameter training
  • A decoding
  • automatic integerization
  • automata defined by policy
  • mixed sparse/dense implementation (per state)
  • parallel execution
  • hybrid models (90 finite-state)

13
The Dyna language
  • A toolkits job is to abstract away the
    semantics, operations, and algorithmsfor a
    particular domain.
  • In contrast, Dyna is domain-independent.
  • (like MapReduce, Bigtable, etc.)
  • Manages data computations that you specify.
  • Toolkits or applications can be built on top.

14
Warning
  • Lots more beyond this talk
  • See http//dyna.org
  • read our papers
  • download an earlier prototype
  • sign up for updates by email
  • wait for the totally revamped next version ?

15
A Quick Sketch of Dyna
16
How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
17
How you build a system (big picture slide)
cool model
Dyna language specifies these
equations. Most programs just need to compute
some values from other values. Any order is
ok. Feed-forward! Dynamic programming! Message
passing! (including Gibbs) Must quickly figure
out what influences what. Compute Markov
blanket Compute transitions in state machine
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
18
How you build a system (big picture slide)
cool model
  • Dyna language specifies these equations.
  • Most programs just need to compute some values
    from other values. Any order is ok.
  • Some programs also need to update the outputs if
    the inputs change
  • spreadsheets, makefiles, email readers
  • dynamic graph algorithms
  • EM and other iterative optimization
  • Energy of a proposed configuation for MCMC
  • leave-one-out training of smoothing params

practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
19
How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20
Writing equations in Dyna
  • int a.
  • a b c.
  • a will be kept up to date if b or c changes.
  • b x.b y. equivalent to b xy.
  • b is a sum of two variables. Also kept up to
    date.
  • c z(1).c z(2).c z(3).
  • c z(four).c z(foo(bar,5)).

c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
21
More interesting use of patterns
  • a b c.
  • scalar multiplication
  • a(I) b(I) c(I).
  • pointwise multiplication
  • a b(I) c(I). means a b(I)c(I)
  • dot product could be sparse
  • a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
  • matrix multiplication could be sparse
  • J is free on the right-hand side, so we sum over
    it

22
Dyna vs. Prolog
  • By now you may see what were up to!
  • Prolog has Horn clauses
  • a(I,K) - b(I,J) , c(J,K).
  • Dyna has Horn equations
  • a(I,K) b(I,J) c(J,K).

Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Terms have values
23
Some connections and intellectual debts
  • Deductive parsing schemata (preferably weighted)
  • Goodman, Nederhof, Pereira, McAllester, Warren,
    Shieber, Schabes, Sikkel
  • Deductive databases (preferably with aggregation)
  • Ramakrishnan, Zukowski, Freitag, Specht, Ross,
    Sagiv,
  • Query optimization
  • Usually limited to decidable fragments, e.g.,
    Datalog
  • Theorem proving
  • Theorem provers, term rewriting, etc.
  • Nonmonotonic reasoning
  • Programming languages
  • Efficient Prologs (Mercury, XSB, )
  • Probabilistic programming languages (PRISM, IBAL
    )
  • Declarative networking (P2)
  • XML processing languages (XTatic, CDuce)
  • Functional logic programming (Curry, )
  • Self-adjusting computation, adaptive memoization
    (Acar et al.)

Increasing interest in resurrecting declarative
and logic-based system specifications.
24
Example CKY and Variations
25
The CKY inside algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 csentence_length 30 cin gtgt c
// get more axioms from stdin cout ltlt cgoal
// print total weight of all parses
26
Visual debugger Browse the proof forest
27
Visual debugger Browse the proof forest
28
Parameterization
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • rewrite(X,Y,Z) doesnt have to be an atomic
    parameter
  • urewrite(X,Y,Z) weight1(X,Y).
  • urewrite(X,Y,Z) weight2(X,Z).
  • urewrite(X,Y,Z) weight3(Y,Z).
  • urewrite(X,Same,Same) weight4.
  • urewrite(X) urewrite(X,Y,Z).
    normalizing constant
  • rewrite(X,Y,Z) urewrite(X,Y,Z) / urewrite(X).
    normalize

29
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

30
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

31
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
max max max
log log log
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

32
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
33
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
34
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

Again, no change to the Dyna program
35
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

Basically, just add extra arguments to the terms
above
36
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

37
Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
38
Rule binarization
phrase(X,I,J) phrase(Y,I,Mid)
phrase(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
39
Program transformations
cool model
Blatz Eisner (FG 2007) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
40
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?

41
Earleys algorithm in Dyna
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
magic templates transformation (as noted by
Minnen 1996)
42
Related algorithms in Dyna?
phrase(X,I,J) rewrite(X,W)
word(W,I,J). phrase(X,I,J) rewrite(X,Y,Z)
phrase(Y,I,Mid) phrase(Z,Mid,J). goal
phrase(s,0,sentence_length).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Binarized CKY?
  • Earleys algorithm?
  • Epsilon symbols?

word(epsilon,I,I) 1. (i.e., epsilons are freely
available everywhere)
43
Some examples from my lab (as of 2006,
w/prototype)
  • Parsing using
  • factored dependency models (Dreyer, Smith,
    Smith CONLL06)
  • with annealed risk minimization (Smith and Eisner
    EMNLP06)
  • constraints on dependency length (Eisner Smith
    IWPT05)
  • unsupervised learning of deep transformations (see
    Eisner EMNLP02)
  • lexicalized algorithms (see Eisner Satta
    ACL99, etc.)
  • Grammar induction using
  • partial supervision (Dreyer Eisner EMNLP06)
  • structural annealing (Smith Eisner ACL06)
  • contrastive estimation (Smith Eisner GIA05)
  • deterministic annealing (Smith Eisner ACL04)
  • Machine translation using
  • Very large neighborhood search of
    permutations (Eisner Tromble, NAACL-W06)
  • Loosely syntax-based MT (Smith Eisner in
    prep.)
  • Synchronous cross-lingual parsing (Smith Smith
    EMNLP04)
  • Finite-state methods for morphology, phonology,
    IE, even syntax
  • Unsupervised cognate discovery (Schafer
    Yarowsky 05, 06)
  • Unsupervised log-linear models via contrastive
    estimation (Smith Eisner ACL05)
  • Context-based morph. disambiguation (Smith,
    Smith Tromble EMNLP05)

Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
44
Can it express everything in NLP? ?
  • Remember, it integrates tightly with C, so you
    only have to use it where its helpful,and write
    the rest in C. Small is beautiful.
  • Of course, it is Turing complete ?

45
One Execution Strategy(forward chaining)
46
How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
47
Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
48
How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
49
Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (given static type
info) (implicit storage,symbiotic storage,
various data structures, support for
indices,stack vs. heap, )
chart of derived items with current values
50
Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
chart of derived items with current values
51
More issues in implementing inference
  • Handling non-distributive updates
  • Replacement
  • p max q(X). what if current max q(0) is
    reduced?
  • Retraction
  • p max q(X). what if q(0) becomes unprovable
    (no value)?
  • Non-distributive rules
  • p 1/q(X). adding ? to q(0) doesnt simply
    add to p
  • Backpointers (hyperedges in the derivation
    forest)
  • Efficient storage, or on-demand recomputation
  • Information flow between f(3), f(int X), f(X)

52
More issues in implementing inference
  • User-defined priorities
  • priority(phrase(X,I,J)) -(J-I). CKY (narrow
    to wide)
  • priority(phrase(X,I,J)) phrase(X,I,J).
    uniform-cost
  • Can we learn a good priority function? (can be
    dynamic)
  • User-defined parallelization
  • host(phrase(X,I,J)) J.
  • Can we learn a host choosing function? (can be
    dynamic)
  • User-defined convergence tests

heuristic(X,I,J)
A
53
More issues in implementing inference
  • Time-space tradeoffs
  • Which queries to index, and how?
  • Selective or temporary memoization
  • Can we learn a policy?
  • On-demand computation (backward chaining)
  • Prioritizing subgoals query planning
  • Safely invalidating memos
  • Mixing forward-chaining and backward-chaining
  • Can we choose a good mixed strategy?

54
Parameter training
objective functionas a theorems value
  • Maximize some objective function.
  • Use Dyna to compute the function.
  • Then how do you differentiate it?
  • for gradient ascent,conjugate gradient, etc.
  • gradient of log-partition function also tells
    us the expected counts for EM

e.g., inside algorithm computes likelihood of the
sentence
  • Two approaches supported
  • Tape algorithm remember agenda order and run it
    backwards.
  • Program transformation automatically derive the
    outside formulas.

55
Automatic differentiation via the gradient
transform
  • a b c. ?
  • Now g(x) denotes ?f/?x, f being the objective
    func.
  • Examples
  • Backprop for neural networks
  • Backward algorithm for HMMs and CRFs
  • Outside algorithm for PCFGs
  • g(b) a g(c).
  • g(a) g(b) c.

Dyna implementation also supports tape-based
differentiation.
56
More on Program Transformations
57
Program transformations
  • An optimizing compiler would like the freedom to
    radically rearrange your code.
  • Easier in a declarative language than in C.
  • Dont need to reconstruct the source programs
    intended semantics.
  • Also, source program is much shorter.
  • Search problem (open) Find a good sequence of
    transformations (on a given workload).

58
Variable elimination
  • Dechters bucket elimination for hard
    constraints
  • But how do we do it for soft constraints?
  • How do we join soft constraints?


Bucket E E ¹ D, E ¹ C Bucket D D ¹
A Bucket C C ¹ B Bucket B B ¹ A Bucket A
join all constraints in Es bucket
yielding a new constraint on D (and C)
now join all constraints in Ds bucket
figure thanks to Rina Dechter
59
Variable elimination via a folding transform
  • goal max f1(A,B)f2(A,C)f3(A,D)f4(C,E)f5(D,E).
  • tempE(C,D)
  • tempE(C,D) max f4(C,E)f5(D,E).
  • Undirected graphical model

to eliminate E, join constraints mentioning
E, and project E out
figure thanks to Rina Dechter
60
Variable elimination via a folding transform
  • goal max f1(A,B)f2(A,C)f3(A,D)tempE(C,D).
  • tempD(A,C)
  • tempD(A,C) max f3(A,D)tempE(C,D).
  • tempE(C,D) max f4(C,E)f5(D,E).
  • Undirected graphical model

to eliminate D, join constraints mentioning
D, and project D out
figure thanks to Rina Dechter
61
Variable elimination via a folding transform
  • goal max f1(A,B)f2(A,C)tempD(A,C).
  • tempC(A)
  • tempC(A) max f2(A,C)tempD(A,C).
  • tempD(A,C) max f3(A,D)tempE(C,D).
  • tempE(C,D) max f4(C,E)f5(D,E).
  • Undirected graphical model


figure thanks to Rina Dechter
62
Variable elimination via a folding transform
  • goal max tempC(A)f1(A,B).
  • tempB(A) max f1(A,B).
  • tempC(A) max f2(A,C)tempD(A,C).
  • tempD(A,C) max f3(A,D)tempE(C,D).
  • tempE(C,D) max f4(C,E)f5(D,E).
  • Undirected graphical model


tempB(A)
figure thanks to Rina Dechter
63
Variable elimination via a folding transform
  • goal max tempC(A)tempB(A).
  • tempB(A) max f1(A,B).
  • tempC(A) max f2(A,C)tempD(A,C).
  • tempD(A,C) max f3(A,D)tempE(C,D).
  • tempE(C,D) max f4(C,E)f5(D,E).
  • Undirected graphical model


could replace max with throughout, to compute
partition function Z
figure thanks to Rina Dechter
64
Grammar specialization as an unfolding transform
  • phrase(X,I,J) rewrite(X,Y,Z) phrase(Y,I,Mid)
    phrase(Z,Mid,J).
  • rewrite(s,np,vp) 0.7.
  • phrase(s,I,J) 0.7 phrase(np,I,Mid)
    phrase(vp,Mid,J).
  • s(I,J) 0.7 np(I,Mid)
    vp(Mid,J).

unfolding
term flattening
(actually handled implicitly by subtype storage
declarations)
65
On-demand computation via a magic templates
transform
  • a - b, c. ?
  • Examples
  • Earleys algorithm for parsing
  • Left-corner filter for parsing
  • On-the-fly composition of FSTs
  • The weighted generalization turns out to be the
    generalized A algorithm (coarse-to-fine
    search).
  • a - magic(a), b, c.
  • magic(b) - magic(a).
  • magic(c) - magic(a), b.

66
Speculation transformation(generalization of
folding)
  • Perform some portion of computation
    speculatively, before we have all the inputs we
    need
  • Fill those inputs in later
  • Examples from parsing
  • Gap passing in categorial grammar
  • Build an S/NP (a sentence missing its direct
    object NP)
  • Transform a parser so that it preprocesses the
    grammar
  • E.g., unary rule closure or epsilon closure
  • Build phrase(np,I,K) from a phrase(s,I,K) we
    dont have yet (so we havent yet chosen a
    particular I, K)
  • Transform lexical context-free parsing from O(n5)
    ? O(n3)
  • Add left children to a constituent we dont have
    yet (without committing to its width)
  • Derive Eisner Satta (1999) algorithm

67
A few more language details
  • So youll understand the examples

68
Terms (generalized from Prolog)
  • These are the Objects of the language
  • Primitives
  • 3, 3.14159, myUnicodeString
  • user-defined primitive types
  • Variables
  • X
  • int X type-restricted variable types are tree
    automata
  • Compound terms
  • atom
  • atom(subterm1, subterm2, ) e.g.,
    f(g(h(3),X,Y), Y)
  • Adding support for keyword arguments(similar to
    R, but must support unification)

69
Fixpoint semantics
  • A Dyna program is a finite rule set that defines
    a partial function (map)
  • Map only defines values for ground terms
  • Variables (X,Y,) let us define values for 8ly
    many ground terms
  • Compute a map that satisfies the equations in the
    program
  • Not guaranteed to halt (Dyna is Turing-complete,
    unlike Datalog)
  • Not guaranteed to be unique

Map
70
Fixpoint semantics
  • A Dyna program is a finite rule set that defines
    a partial function (map)
  • Map only defines values for ground terms
  • Map may accept modifications at runtime
  • Runtime input
  • Adjustments to input (dynamic algorithms)
  • Retraction (remove input), detachment (forget
    input but preserve output)

Map
71
Object-oriented features
  • Maps are terms, i.e., first-class objects
  • Maps can appear as subterms or as values
  • Useful for encapsulating data and passing it
    around
  • fst3 compose(fst1, fst2). value of fst3 is
    a chart
  • forest parse(sentence).
  • Typed by their public interface
  • fst4-gtedge(Q,R) fst3-gtedge(R,Q).
  • Maps can be stored in files and loaded from files
  • Human-readable format (looks like a Dyna program)
  • Binary format (mimics in-memory layout)

72
Functional features Auto-evaluation
  • Terms can have values.
  • So by default, subterms are evaluated in place.
  • Arranged by a simple desugaring transformation
  • foo( X ) 3bar(X).
  • ? foo( X ) B is bar(X), Result is 3B,
    Result.
  • Possible to suppress evaluation f(x) or force it
    f(x)
  • Some contexts also suppress evaluation.
  • Variables are replaced with their bindings but
    not otherwise evaluated.

2 things to evaluate here bar and
73
Functional features Auto-evaluation
  • Terms can have values.
  • So by default, subterms are evaluated in place.
  • Arranged by a simple desugaring transformation
  • foo(f(X)) 3bar(g(X)).
  • ? foo( F )
  • Possible to suppress evaluation f(x) or force it
    f(x)
  • Some contexts also suppress evaluation.
  • Variables are replaced with their bindings but
    not otherwise evaluated.

F is f(X), G is g(X), B is bar(G), Result is
3B, Result.
74
Other handy features
  • fact(0) 1.
  • fact(int N) N gt 0, Nfact(N-1).
  • 0! 1.
  • (int N)! N(N-1)! if N 1.

user-defined syntactic sugar
Unicode
75
Aggregation operators
  • f(X) 3. immutable
  • f(X) 3. can be incremented later
  • f(X) min 3. can be reduced later
  • f(X) 3. can be arbitrarily changed
    later
  • f(X) gt 3. like but can be overridden
    by more specific rule

76
Aggregation operators
  • f(X) 1. can be arbitrarily changed
    later
  • Non-monotonic reasoning
  • flies(bird X) true.
  • flies(bird X) penguin(X), false. overrides
  • flies(bigbird) false.
    also overrides
  • Iterative update algorithms (EM, Gibbs, BP)
  • a init_a.
  • a updated_a(b). will override once b is
    proved
  • b updated_b(a).

77
Declarations(ultimately, should be chosen
automatically)
  • at term level
  • lazy vs. eager computational strategies
  • memoization and flushing strategies
  • prioritization, parallelization, etc.
  • at class level
  • class an implementation of a type
  • type some subset of the term universe
  • class specifies storage strategy
  • classes may implement overlapping types

78
Frozen variables
  • Dyna map semantics concerns ground terms.
  • But want to be able to reason about non-ground
    terms, too.
  • Manipulate Dyna rules (which are non-ground
    terms)
  • Work with classes of ground terms (specified by
    non-ground terms)
  • Queries, memoized queries
  • Memoization, updating, prioritization of updates,
  • So, allow ground terms that contain frozen
    variables
  • Treatment under unification is beyond scope of
    this talk
  • priority(f(X)) f(X). for each X
  • priority(f(X)) infinity. frozen
    non-ground term

79
Gensyms
80
Some More Examples
  • Shortest paths
  • Neural nets
  • Vector-space IR
  • FST composition
  • Generalized A parsing

n-gram smoothing Arc consistency Game trees Edit
distance
81
Path-finding in Prolog
  • pathto(1). the start of all pathspathto(V)
    - edge(U,V), pathto(U).
  • When is the query pathto(14) really inefficient?
  • Whats wrong with this swapped version?
  • pathto(V) - pathto(U), edge(U,V).

14
82
Shortest paths in Dyna
  • Single source
  • pathto(start) min 0.
  • pathto(W) min pathto(V) edge(V,W).
  • All pairs
  • path(U,U) min 0.
  • path(U,W) min path(U,V) edge(V,W).
  • This hint gives Dijkstras algorithm (pqueue)
  • priority(pathto(V) min Delta) Delta.
  • Must also declare that pathto(V) has converged as
    soon as it pops off the priority queue this is
    true if heuristic is admissible.

can change min to to sum over paths (e.g.,
PageRank)
heuristic(V).
83
Neural networks in Dyna
  • out(Node) sigmoid(in(Node)).
  • sigmoid(X) 1/(1exp(-X)).
  • in(Node) weight(Node,Previous)out(Previous).
  • in(Node) input(Node).
  • error (out(Node)-target(Node))2.

Recurrent neural net is ok
84
Vector-space IR in Dyna
  • bestscore(Query) max score(Query,Doc).
  • score(Query,Doc) tf(Query,Word)tf(Doc,Word)i
    df(Word).
  • idf(Word) 1/log(df(Word)).
  • df(Word) 1 whenever tf(Doc,Word) gt 0.

85
Weighted FST composition in Dyna(epsilon-free
case)
  • start(A o B) start(A) x start(B).
  • stop(A o B, Q x R) stop (A, Q) stop (B, R).
  • arc(A o B, Q1 x R1, Q2 x R2, In, Out) arc(A,
    Q1, Q2, In, Match) arc(B, R1, R2, Match,
    Out).
  • Computes full cross-product.
  • Use magic templates transform to build only
    reachable states.

86
n-gram smoothing in Dyna
  • These values all update automatically during
    leave-one-out jackknifing.
  • mle_prob(X,Y,Z) count(X,Y,Z)/count(X,Y).
  • smoothed_prob(X,Y,Z) ?mle_prob(X,Y,Z)
    (1-?)mle_prob(Y,Z).
  • for arbitrary-length contexts, could use lists
  • count_of_count(X,Y,count(X,Y,Z)) 1.
  • Used for Good-Turing and Kneser-Ney smoothing.
  • E.g., count_of_count(the, big, 1) is number
    of word types that appeared exactly once after
    the big.

87
Arc consistency ( 2-consistency)
Agenda algorithm
X3 has no support in Y, so kill it off
Y1 has no support in X, so kill it off
Z1 just lost its only support in Y, so kill it
off
X
Y
?
3
2,
1,
3
2,
1,
X, Y, Z, T 1..3 X ? Y Y Z T ? Z X lt T
Note These steps can occur in somewhat arbitrary
order

?
3
2,
1,
3
2,
1,
?
T
Z
slide thanks to Rina Dechter (modified)
88
Arc consistency in Dyna (AC-4 algorithm)
  • Axioms (alternatively, could define them by
    rule)
  • indomain(VarVal) define some
    values true
  • consistent(VarVal, Var2Val2)
  • Define to be true or false if Var, Var2 are
    co-constrained.
  • Otherwise, leave undefined (or define as true).
  • For VarVal to be kept, Val must be in-domain and
    also not ruled out by any Var2 that cares
  • possible(VarVal) indomain(VarVal).
  • possible(VarVal) supported(VarVal, Var2).
  • Var2 cares if its co-constrained with VarVal
  • supported(VarVal, Var2)
    consistent(VarVal, Var2Val2)
    possible(Var2Val2).

89
Propagating bounds consistency in Dyna
  • E.g., suppose we have a constraint A lt B(as
    well as other constraints on A). Then
  • maxval(a) min maxval(b).
  • if Bs max is reduced, then As should be
    too
  • minval(b) max minval(a). by symmetry
  • Similarly, if CD 10, then
  • maxval(c) min 10-minval(d).
  • maxval(d) min 10-minval(c).
  • minval(c) max 10-maxval(d).
  • minval(d) max 10-maxval(c).

90
Game-tree analysis
  • All values represent total advantage to player 1
    starting at this board.
  • how good is Board for player 1, if its player
    1s move?
  • best(Board) max stop(player1, Board).
  • best(Board) max move(player1, Board, NewBoard)
    worst(NewBoard).
  • how good is Board for player 1, if its player
    2s move? (player 2 is trying to make player 1
    lose zero-sum game)
  • worst(Board) min stop(player2, Board).
  • worst(Board) min move(player2, Board,
    NewBoard) best(NewBoard).
  • How good for player 1 is the starting board?
  • goal best(Board) if start(Board).

91
Edit distance between two strings
Traditional picture
92
Edit distance in Dyna
  • dist(, ) 0.
  • dist(XXs,Ys) min dist(Xs,Ys) delcost(X).
  • dist(Xs,YYs) min dist(Xs,Ys) inscost(Y).
  • dist(XXs,YYs) min dist(Xs,Ys)
    substcost(X,Y).
  • substcost(L,L) 0.
  • result align(c, l, a, r, a, c,
    a, c, a).

93
Edit distance in Dyna on input lattices
  • dist(S,T) min dist(S,T,Q,R) S?final(Q)
    T?final(R).
  • dist(S,T, S-gtstart, T-gtstart) min 0.
  • dist(S,T, I2, J) min dist(S,T, I, J)
    S?arc(I,I2,X) delcost(X).
  • dist(S,T, I, J2) min dist(S,T, I, J)
    T?arc(J,J2,Y) inscost(Y).
  • dist(S,T, I2,J2) min dist(S,T, I, J)
    S?arc(I,I2,X) S?arc(J,J2,Y)
    substcost(X,Y).
  • substcost(L,L) 0.
  • result dist(lattice1, lattice2).
  • lattice1 startstate(0).
  • arc(state(0),state(1),c)0.3.
  • arc(state(1),state(2),l)0.
  • final(state(5)).

94
Generalized A parsing (CKY)
  • Get Viterbi outside probabilities.
  • Isomorphic to automatic differentiation
    (reverse mode).
  • outside(goal) 1.
  • outside(Body) max outside(Head)
    whenever rule(Head max Body).
  • outside(phrase B) max (phrase A)
    outside((AB)).
  • outside(phrase A) max outside((AB)) (phrase
    B).
  • Prioritize by outside estimates from coarsened
    grammar.
  • priority(phrase P) (P) outside(coarsen(P)).
  • priority(phrase P) 1 if Pcoarsen(P).
    can't coarsen any further

95
Generalized A parsing (CKY)
  • coarsen nonterminals.
  • coa("PluralNoun") "Noun".
  • coa("Noun") "Anything".
  • coa("Anything") "Anything".
  • coarsen phrases.
  • coarsen(phrase(X,I,J)) phrase(coa(X),I,J).
  • make successively coarser grammars
  • each is an admissible estimate for the
    next-finer one.
  • coarsen(rewrite(X,Y,Z)) rewrite(coa(X),coa(Y),co
    a(Z)).
  • coarsen(rewrite(X,Word)) rewrite(coa(X),Word).
  • coarsen(Rule) max Rule.
  • i.e., Coarse max Rule whenever
    Coarsecoarsen(Rule).

96
Lightweight information interchange?
  • Easy for Dyna terms to represent
  • XML data (Dyna types are analogous to DTDs)
  • RDF triples (semantic web)
  • Annotated corpora
  • Ontologies
  • Graphs, automata, social networks
  • Also provides facilities missing from semantic
    web
  • Queries against these data
  • State generalizations (rules, defaults) using
    variables
  • Aggregate data and draw conclusions
  • Keep track of provenance (backpointers)
  • Keep track of confidence (weights)
  • Map deductive database in a box
  • Like a spreadsheet, but more powerful, safer to
    maintain, and can communicate with outside world

97
How fast was the prototype version?
  • It used one size fits all strategies
  • Asymptotically optimal, but
  • 4 times slower than Mark Johnsons inside-outside
  • 4-11 times slower than Klein Mannings Viterbi
    parser
  • 5-6x speedup not too hard to get

98
Are you going to make it faster?
(yup!)
  • Static analysis
  • Mixed storage strategies
  • store X in an array
  • store Y in a hash
  • Mixed inference strategies
  • dont store Z (compute on demand)
  • Choose strategies by
  • User declarations
  • Automatically by execution profiling

99
Summary
  • AI systems are too hard to write and modify.
  • Need a new layer of abstraction.
  • Dyna is a language for computation (no I/O)
  • Simple, powerful idea
  • Define values from other values by weighted
    logic.
  • Produces classes that interface with C, etc.
  • Compiler supports many implementation strategies
  • Tries to abstract and generalize many tricks
  • Fitting a strategy to the workload is a great
    opportunity for learning!
  • Natural fit to fine-grained parallelization
  • Natural fit to web services

100
Dyna contributors!
  • Prototype (available)
  • Eric Goldlust (core compiler), Noah A. Smith
    (parameter training), Markus Dreyer (front-end
    processing), David A. Smith, Roy Tromble,
    Asheesh Laroia
  • All-new version (under development)
  • Nathaniel Filardo (core compiler), Wren Ng
    Thornton (core compiler), Jay Van Der Wall
    (source language parser), John Blatz
    (transformations and inference), Johnny
    Graettinger (early design), Eric Northup (early
    design)
  • Dynasty hypergraph browser (usable)
  • Michael Kornbluh (initial version), Gordon
    Woodhull (graph layout), Samuel Huang (latest
    version), George Shafer, Raymond Buse,
    Constantinos Michael

101
FIN
102
the case forLittle Languages
  • declarative programming
  • small is beautiful

103
Sapir-Whorf hypothesis
  • Language shapes thought
  • At least, it shapes conversation
  • Computer language shapes thought
  • At least, it shapes experimental research
  • Lots of cute ideas that we never pursue
  • Or if we do pursue them, it takes 6-12 months to
    implement on large-scale data
  • Have we turned into a lab science?

104
Declarative Specifications
  • State what is to be done
  • (How should the computer do it? Turn that over
    to a general solver that handles the
    specification language.)
  • Hundreds of domain-specific little languages
    out there. Some have sophisticated solvers.

105
dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
106
dot (www.graphviz.org)
107
LilyPond (www.lilypond.org)
108
LilyPond (www.lilypond.org)
109
Declarative Specs in NLP
  • Regular expression (for a FST toolkit)
  • Grammar (for a parser)
  • Feature set (for a maxent distribution, SVM,
    etc.)
  • Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
110
Declarative Specs in NLP
  • Regular expression (for a FST toolkit)
  • Grammar (for a parser)
  • Feature set (for a maxent distribution, SVM,
    etc.)

111
New examples of dynamic programming in NLP
  • Parameterized finite-state machines

112
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

113
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

114
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
115
Parameterized FSMs
Knight Graehl 1997 - transliteration
116
Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
117
Finite-State Operations
  • Projection GIVES YOU marginal distribution

p(x,y)
domain(
)
118
Finite-State Operations
  • Probabilistic union GIVES YOU mixture model

p(x)
0.3
q(x)
119
Finite-State Operations
  • Probabilistic union GIVES YOU mixture model

?
p(x)
q(x)
Learn the mixture parameter ?!
120
Finite-State Operations
  • Composition GIVES YOU chain rule

p(xy)
o
p(yz)
  • The most popular statistical FSM operation
  • Cross-product construction

121
Finite-State Operations
  • Concatenation, probabilistic closure
    HANDLE unsegmented text

0.3
p(x)
p(x)
q(x)
  • Just glue together machines for the different
    segments, and let them figure out how to align
    with the text

122
Finite-State Operations
  • Directed replacement MODELS noise or
    postprocessing

p(x,y)
o
  • Resulting machine compensates for noise or
    postprocessing

123
Finite-State Operations
  • Intersection GIVES YOU product models
  • e.g., exponential / maxent, perceptron, Naïve
    Bayes,
  • Need a normalization op too computes ?x f(x)
    pathsum or
    partition function

p(x)

q(x)
  • Cross-product construction (like composition)

124
Finite-State Operations
  • Conditionalization (new operation)

p(x,y)
condit(
)
  • Resulting machine can be composed with other
    distributions p(y x) q(x)

125
New examples of dynamic programming in NLP
  • Parameterized infinite-state machines

126
Universal grammar as a parameterized FSA over an
infinite state space
127
New examples of dynamic programming in NLP
  • More abuses of finite-state machines

128
Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C

voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
129
Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
130
Directional Best Paths construction
  • Keep best output string for each input string
  • Yields a new transducer (size ?? 3n)

For input abc abc axc For input abd axd
Must allow red arc just if next input is d
131
Minimization of semiring-weighted FSAs
  • New definition of ? for pushing
  • ?(q) weight of the shortest path from
    q, breaking ties alphabetically on input
    symbols
  • Computation is simple, well-defined, independent
    of (K, ?)
  • Breadth-first search back from final states

Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
132
New examples of dynamic programming in NLP
  • Tree-to-tree alignment

133
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
134
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
135
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
136
New examples of dynamic programming in NLP
  • Bilexical parsing in O(n3)
  • (with Giorgio Satta)

137
Lexicalized CKY
loves
Mary
the
girl
outdoors
138
Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
139
Idea 1
  • Combine B with what C?
  • must try different-width Cs (vary k)
  • must try differently-headed Cs (vary h)
  • Separate these!

140
Idea 1
(the old CKY way)
141
Idea 2
  • Some grammars allow

142
Idea 2
  • Combine what B and C?
  • must try different-width Cs (vary k)
  • must try different midpoints j
  • Separate these!

143
Idea 2
(the old CKY way)
144
Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
145
An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors





146
(No Transcript)
147
New examples of dynamic programming in NLP
  • O(n)-time partial parsing by limiting dependency
    length
  • (with Noah A. Smith)

148
Short-Dependency Preference
  • A words dependents (adjuncts, arguments)
  • tend to fall near it
  • in the string.

149
length of a dependency surface distance
3
1
1
1
150
50 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
151
Related Ideas
  • Score parses based on whats between a head and
    child
  • (Collins, 1997 Zeman, 2004 McDonald et al.,
    2005)
  • Assume short ? faster human processing
  • (Church, 1980 Gibson, 1998)
  • Attach low heuristic for PPs (English)
  • (Frazier, 1979 Hobbs and Bear, 1990)
  • Obligatory and optional re-orderings (English)
  • (see paper)

152
Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
153
Hard Constraints
  • Disallow dependencies between words of distance gt
    b ...
  • Risk best parse contrived, or no parse at all!
  • Solution allow fragments (partial parsing
    Hindle, 1990 inter alia).
  • Why not model the sequence of fragments?

154
Building a Vine SBG Parser
  • Grammar generates sequence of trees from
  • Parser recognizes sequences of trees without
    long dependencies
  • Need to modify training data
  • so the model is consistent
  • with the parser.

155

8
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
156

would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
157

would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
158

would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
159

would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
160

would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
161
Vine Grammar is Regular
  • Even for small b, bunches can grow to arbitrary
    size
  • But arbitrary center embedding is out

162
Vine Grammar is Regular
  • Could compile into an FSA and get O(n) parsing!
  • Problem whats the grammar constant?

EXPONENTIAL
  • insider has no parent
  • cut and would can have more children
  • can have more children

FSA
According to some estimates , the rule changes
would cut insider ...
163
Alternative
  • Instead, we adapt
  • an SBG chart parser
  • which implicitly shares fragments of stack state
  • to the vine case,
  • eliminating unnecessary work.

164
Limiting dependency length
  • Linear-time partial parsing

Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
  • Natural-language dependencies tend to be short
  • So even if you dont have enough data to model
    what the heads are
  • you might want to keep track of where they are.

165
Limiting dependency length
  • Linear-time partial parsing
  • Dont convert into an FSA!
  • Less structure sharing
  • Explosion of states for different stack
    configurations
  • Hard to get your parse back

Finite-state model of root sequence
NP
S
NP
Bounded dependencylengt
Write a Comment
User Comments (0)
About PowerShow.com