Declarative Specification of NLP Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Declarative Specification of NLP Systems

Description:

Declarative Specification of NLP Systems Jason Eisner student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble – PowerPoint PPT presentation

Number of Views:396
Avg rating:3.0/5.0
Slides: 156
Provided by: Jason454
Learn more at: http://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: Declarative Specification of NLP Systems


1
Declarative Specification of NLP Systems
  • Jason Eisner

student co-authors on various parts of this work
Eric Goldlust, Noah A. Smith, John Blatz, Roy
Tromble
IBM, May 2006
2
An Anecdote from ACL05
-Michael Jordan
3
An Anecdote from ACL05
-Michael Jordan
4
Conclusions to draw from that talk
  1. Mike his students are great.
  2. Graphical models are great.(because theyre
    flexible)
  3. Gibbs sampling is great.(because it works with
    nearly any graphical model)
  4. Matlab is great.(because it frees up Mike and
    his students to doodle all day and then execute
    their doodles)

5
Could NLP be this nice?
  1. Mike his students are great.
  2. Graphical models are great.(because theyre
    flexible)
  3. Gibbs sampling is great.(because it works with
    nearly any graphical model)
  4. Matlab is great.(because it frees up Mike and
    his students to doodle all day and then execute
    their doodles)

6
Could NLP be this nice?
  • Parts of it already are
  • Language modeling
  • Binary classification (e.g., SVMs)
  • Finite-state transductions
  • Linear-chain graphical models

Toolkits available you dont have to be an expert
But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
7
Could NLP be this nice?
  • This talk A toolkit thats general enough for
    these cases.
  • (stretches from finite-state to Turing machines)
  • Dyna

But other parts arent Context-free and
beyond Machine translation
Efficient parsers and MT systems are complicated
and painful to write
8
Warning
  • Lots more beyond this talk
  • see the EMNLP05 and FG06 papers
  • see http//dyna.org
  • (download documentation)
  • sign up for updates by email
  • wait for the totally revamped next version ?

9
the case forLittle Languages
  • declarative programming
  • small is beautiful

10
Sapir-Whorf hypothesis
  • Language shapes thought
  • At least, it shapes conversation
  • Computer language shapes thought
  • At least, it shapes experimental research
  • Lots of cute ideas that we never pursue
  • Or if we do pursue them, it takes 6-12 months to
    implement on large-scale data
  • Have we turned into a lab science?

11
Declarative Specifications
  • State what is to be done
  • (How should the computer do it? Turn that over
    to a general solver that handles the
    specification language.)
  • Hundreds of domain-specific little languages
    out there. Some have sophisticated solvers.

12
dot (www.graphviz.org)
digraph g graph rankdir "LR" node
fontsize "16 shape "ellipse" edge
"node0" label "ltf0gt 0x10ba8 ltf1gt"shape
"record" "node1" label "ltf0gt 0xf7fc4380
ltf1gt ltf2gt -1"shape "record"
"node0"f0 -gt "node1"f0 id 0 "node0"f1
-gt "node2"f0 id 1 "node1"f0 -gt
"node3"f0 id 2
nodes
edges
Whats the hard part? Making a nice
layout! Actually, its NP-hard
13
dot (www.graphviz.org)
14
LilyPond (www.lilypond.org)
15
LilyPond (www.lilypond.org)
16
Declarative Specs in NLP
  • Regular expression (for a FST toolkit)
  • Grammar (for a parser)
  • Feature set (for a maxent distribution, SVM,
    etc.)
  • Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk Sometimes its best to peek
under the shiny surface. Declarative methods are
still great, but should be layeredwe need them
one level lower, too.
17
Declarative Specs in NLP
  • Regular expression (for a FST toolkit)
  • Grammar (for a parser)
  • Feature set (for a maxent distribution, SVM,
    etc.)

18
Declarative Specification of Algorithms
19
How you build a system (big picture slide)
cool model
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
20
Wait a minute
Didnt I just implement something like this last
month?
chart management / indexing cache-conscious data
structures prioritization of partial solutions
(best-first, A) parameter management inside-outsi
de formulas different algorithms for training and
decoding conjugate gradient, annealing,
... parallelization?
I thought computers were supposed to automate
drudgery
21
How you build a system (big picture slide)
cool model
  • Dyna language specifies these equations.
  • Most programs just need to compute some values
    from other values. Any order is ok.
  • Some programs also need to update the outputs if
    the inputs change
  • spreadsheets, makefiles, email readers
  • dynamic graph algorithms
  • EM and other iterative optimization
  • leave-one-out training of smoothing params

practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
22
How you build a system (big picture slide)
cool model
practical equations
PCFG
Compilation strategies (well come back
to this)
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
23
Writing equations in Dyna
  • int a.
  • a b c.
  • a will be kept up to date if b or c changes.
  • b x.b y. equivalent to b xy.
  • b is a sum of two variables. Also kept up to
    date.
  • c z(1).c z(2).c z(3).
  • c z(four).c z(foo(bar,5)).

c z(N).
c is a sum of all nonzero z() values. At
compile time, we dont know how many!
24
More interesting use of patterns
  • a b c.
  • scalar multiplication
  • a(I) b(I) c(I).
  • pointwise multiplication
  • a b(I) c(I). means a b(I)c(I)
  • dot product could be sparse
  • a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
  • matrix multiplication could be sparse
  • J is free on the right-hand side, so we sum over
    it

25
Dyna vs. Prolog
  • By now you may see what were up to!
  • Prolog has Horn clauses
  • a(I,K) - b(I,J) , c(J,K).
  • Dyna has Horn equations
  • a(I,K) b(I,J) c(J,K).

Like Prolog Allow nested terms Syntactic sugar
for lists, etc. Turing-complete
Unlike Prolog Charts, not backtracking! Compile
? efficient C classes Integrates with your C
code
26
The CKY inside algorithm in Dyna
- double item 0. - bool length
false. constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z). goal
constit(s,0,N) if length(N).
using namespace cky chart c crewrite(s,np,
vp) 0.7 cword(Pierre,0,1)
1 clength(30) true // 30-word sentence cin
gtgt c // get more axioms from stdin cout ltlt
cgoal // print total weight of all parses
27
visual debugger browse the proof forest
ambiguity
shared substructure
28
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

29
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

30
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
max max max
log log log
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

31
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

c word(Pierre, 0, 1)
1
state(5)
state(9)
0.2
air/0.3
8
9
P/0.5
Pierre/0.2
5
32
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

Just add words one at a time to the chart Check
at any time what can be derived from words so
far Similarly, dynamic grammars
33
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

Again, no change to the Dyna program
34
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

Basically, just add extra arguments to the terms
above
35
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

36
Earleys algorithm in Dyna
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
magic templates transformation (as noted by
Minnen 1996)
37
Program transformations
cool model
Blatz Eisner (FG 2006) Lots of
equivalent ways to write a system of
equations! Transforming from one to another
mayimprove efficiency. Many parsing tricks
can be generalized into automatic
transformations that help other programs, too!
practical equations
PCFG
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
38
Related algorithms in Dyna?
constit(X,I,J) word(W,I,J)
rewrite(X,W). constit(X,I,J)
constit(Y,I,Mid) constit(Z,Mid,J)
rewrite(X,Y,Z). goal constit(s,0,N)
if length(N).
  • Viterbi parsing?
  • Logarithmic domain?
  • Lattice parsing?
  • Incremental (left-to-right) parsing?
  • Log-linear parsing?
  • Lexicalized or synchronous parsing?
  • Earleys algorithm?
  • Binarized CKY?

39
Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
X
Y
Z
Z
Y
Mid
J
I
Mid
40
Rule binarization
constit(X,I,J) constit(Y,I,Mid)
constit(Z,Mid,J) rewrite(X,Y,Z).
graphical models constraint programming multi-way
database join
41
More program transformations
  • Examples that add new semantics
  • Compute gradient (e.g., derive outside algorithm
    from inside)
  • Compute upper bounds for A (e.g., Klein
    Manning ACL03)
  • Coarse-to-fine (e.g., Johnson Charniak
    NAACL06)
  • Examples that preserve semantics
  • On-demand computation by analogy with Earleys
    algorithm
  • On-the-fly composition of FSTs
  • Left-corner filter for parsing
  • Program specialization as unfolding e.g.,
    compile out the grammar
  • Rearranging computations by analogy with
    categorial grammar
  • Folding reinterpreted as slashed categories
  • Speculative computation using slashed
    categories
  • abstract away repeated computation to do it once
    only by analogy with unary rule closure or
    epsilon-closure
  • derives Eisner Satta ACL99 O(n3) bilexical
    parser

42
How you build a system (big picture slide)
cool model
practical equations
PCFG
Propagate updates from right-to-left through the
equations. a.k.a. agenda algorithm forward
chaining bottom-up inference semi-naïve
bottom-up
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
use a general method
43
Bottom-up inference
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
pp(I,K) prep(I,J) np(J,K)
prep(I,3) ?
prep(2,3) 1.0
s(3,9) 0.15
s(3,7) 0.21
vp(5,K) ?
vp(5,9) 0.5
pp(2,5) 0.3
vp(5,7) 0.7
np(3,5) 0.3
we updated np(3,5)what else must therefore
change?
If np(3,5) hadnt been in the chart already, we
would have added it.
np(3,5) 0.1
no more matches to this query
0.3
chart of derived items with current values
44
How you build a system (big picture slide)
cool model
practical equations
PCFG
Whats going on under the hood?
pseudocode (execution order)
tuned C implementation (data structures, etc.)
for width from 2 to n for i from 0 to n-width
k iwidth for j from i1 to k-1
45
Compiler provides
agenda of pending updates
rules of program
s(I,K) np(I,J) vp(J,K)
np(3,5) 0.3
copy, compare, hashterms fast, via
integerization (interning)
efficient storage of terms (use native C types,
symbiotic storage, garbage collection,seriali
zation, )
chart of derived items with current values
46
Beware double-counting!
agenda of pending updates
combining with itself
rules of program
n(I,K) n(I,J) n(J,K)
n(5,5) 0.2
n(5,5) ?
n(5,5) 0.3
to makeanother copyof itself
epsilon constituent
If np(3,5) hadnt been in the chart already, we
would have added it.
chart of derived items with current values
47
Parameter training
objective functionas a theorems value
  • Maximize some objective function.
  • Use Dyna to compute the function.
  • Then how do you differentiate it?
  • for gradient ascent,conjugate gradient, etc.
  • gradient also tells us the expected counts for
    EM!

e.g., inside algorithm computes likelihood of the
sentence
  • Two approaches
  • Program transformation automatically derive the
    outside formulas.
  • Back-propagation run the agenda algorithm
    backwards.
  • works even with pruning, early stopping, etc.

48
What can Dyna do beyond CKY?
49
Some examples from my lab
  • Parsing using
  • factored dependency models (Dreyer, Smith,
    Smith CONLL06)
  • with annealed risk minimization (Smith and Eisner
    EMNLP06)
  • constraints on dependency length (Eisner Smith
    IWPT05)
  • unsupervised learning of deep transformations (see
    Eisner EMNLP02)
  • lexicalized algorithms (see Eisner Satta
    ACL99, etc.)
  • Grammar induction using
  • partial supervision (Dreyer Eisner EMNLP06)
  • structural annealing (Smith Eisner ACL06)
  • contrastive estimation (Smith Eisner GIA05)
  • deterministic annealing (Smith Eisner ACL04)
  • Machine translation using
  • Very large neighborhood search of
    permutations (Eisner Tromble, NAACL-W06)
  • Loosely syntax-based MT (Smith Eisner in
    prep.)
  • Synchronous cross-lingual parsing (Smith Smith
    EMNLP04)
  • Finite-state methods for morphology, phonology,
    IE, even syntax
  • Unsupervised cognate discovery (Schafer
    Yarowsky 05, 06)
  • Unsupervised log-linear models via contrastive
    estimation (Smith Eisner ACL05)
  • Context-based morph. disambiguation (Smith,
    Smith Tromble EMNLP05)

Easy to try stuff out! Programs are very short
easy to change!
- see also Eisner ACL03)
50
Can it express everything in NLP? ?
  • Remember, it integrates tightly with C, so you
    only have to use it where its helpful,and write
    the rest in C. Small is beautiful.
  • Were currently extending the class of allowed
    formulas beyond the semiring
  • cf. Goodman (1999)
  • will be able to express smoothing, neural nets,
    etc.
  • Of course, it is Turing complete ?

51
Smoothing in Dyna
  • mle_prob(X,Y,Z) context
    count(X,Y,Z)/count(X,Y).
  • smoothed_prob(X,Y,Z) lambdamle_prob(X,Y,Z)
    (1-lambda)mle_prob(Y,Z).
  • for arbitrary n-grams, can use lists
  • count_count(N) 1 whenever N is
    count(Anything).
  • updates automatically during leave-one-out
    jackknifing

52
Information retrieval in Dyna
  • score(Doc) tf(Doc,Word)tf(Query,Word)idf(Wor
    d).
  • idf(Word) 1/log(df(Word)).
  • df(Word) 1 whenever tf(Doc,Word) gt 0.

53
Neural networks in Dyna
  • out(Node) sigmoid(in(Node)).
  • in(Node) input(Node).
  • in(Node) weight(Node,Kid)out(Kid).
  • error (out(Node)-target(Node))2
    if ?target(Node).
  • Recurrent neural net is ok

54
Game-tree analysis in Dyna
  • goal best(Board) if start(Board).
  • best(Board) max stop(player1, Board).
  • best(Board) max move(player1, Board, NewBoard)
    worst(NewBoard).
  • worst(Board) min stop(player2, Board).
  • worst(Board) min move(player2, Board, NewBoard)
    best(NewBoard).

55
Weighted FST composition in Dyna(epsilon-free
case)
  • - bool itemfalse.
  • start (A o B, Q x R) start (A, Q) start (B,
    R).
  • stop (A o B, Q x R) stop (A, Q) stop (B, R).
  • arc (A o B, Q1 x R1, Q2 x R2, In, Out) arc
    (A, Q1, Q2, In, Match) arc (B, R1, R2,
    Match, Out).
  • Inefficient? How do we fix this?

56
Constraint programming (arc consistency)
  • - bool indomainfalse.
  • - bool consistenttrue.
  • variable(Var) indomain(VarVal).
  • possible(VarVal) indomain(VarVal).
  • possible(VarVal) support(VarVal, Var2)
    whenever variable(Var2).
  • support(VarVal, Var2) possible(Var2Val2)
    consistent(VarVal, Var2Val2).

57
Edit distance in Dyna version 1
  • letter1(c,0,1). letter1(l,1,2).
    letter1(a,2,3). clara
  • letter2(c,0,1). letter2(a,1,2).
    letter2(c,2,3). caca
  • end1(5). end2(4). delcost 1. inscost 1.
    substcost 1.
  • align(0,0) 0.
  • align(I1,J2) min align(I1,I2)
    letter2(L2,I2,J2) inscost(L2).
  • align(J1,I2) min align(I1,I2)
    letter1(L1,I1,J1) delcost(L1).
  • align(J1,J2) min align(I1,I2)
    letter1(L1,I1,J1) letter2(L2,I2,J2)
    subcost(L1,L2).
  • align(J1,J2) min align(I1,I2)letter1(L,I1,J1)le
    tter2(L,I2,J2).
  • goal align(N1,N2) whenever end1(N1) end2(N2).

58
Edit distance in Dyna version 2
  • input(c, l, a, r, a, c, a, c,
    a) 0.
  • delcost 1. inscost 1. substcost 1.
  • alignupto(Xs,Ys) min input(Xs,Ys).
  • alignupto(Xs,Ys) min alignupto(XXs,Ys)
    delcost.
  • alignupto(Xs,Ys) min alignupto(Xs,YYs)
    inscost.
  • alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
    stcost.
  • alignupto(Xs,Ys) min alignupto(AXs,AYs).
  • goal min alignupto(, ).

How about different costs for different letters?
59
Edit distance in Dyna version 2
  • input(c, l, a, r, a, c, a, c,
    a) 0.
  • delcost 1. inscost 1. substcost 1.
  • alignupto(Xs,Ys) min input(Xs,Ys).
  • alignupto(Xs,Ys) min alignupto(XXs,Ys)
    delcost.
  • alignupto(Xs,Ys) min alignupto(Xs,YYs)
    inscost.
  • alignupto(Xs,Ys) min alignupto(XXs,YYs)sub
    stcost.
  • alignupto(Xs,Ys) min alignupto(LXs,LYs).
  • goal min alignupto(, ).

(X).
(Y).
(X,Y).
60
Is it fast enough?
(sort of)
  • Asymptotically efficient
  • 4 times slower than Mark Johnsons inside-outside
  • 4-11 times slower than Klein Mannings Viterbi
    parser

61
Are you going to make it faster?
(yup!)
  • Currently rewriting the term classes to match
    hand-tuned code
  • Will support mix-and-matchimplementation
    strategies
  • store X in an array
  • store Y in a hash
  • dont store Z (compute on demand)
  • Eventually, choose strategies automaticallyby
    execution profiling

62
Synopsis your idea ? experimental results fast!
  • Dyna is a language for computation (no I/O).
  • Especially good for dynamic programming.
  • It tries to encapsulate the black art of NLP.
  • Much prior work in this vein
  • Deductive parsing schemata (preferably weighted)
  • Goodman, Nederhof, Pereira, Warren, Shieber,
    Schabes, Sikkel
  • Deductive databases (preferably with aggregation)
  • Ramakrishnan, Zukowski, Freitag, Specht, Ross,
    Sagiv,
  • Probabilistic programming languages (implemented)
  • Zhao, Sato, Pfeffer (also efficient Prologish
    languages)

63
Dyna contributors!
  • Jason Eisner
  • Eric Goldlust, Eric Northup, Johnny Graettinger
    (compiler backend)
  • Noah A. Smith (parameter training)
  • Markus Dreyer, David Smith (compiler frontend)
  • Mike Kornbluh, George Shafer, Gordon Woodhull,
    Constantinos Michael, Ray Buse (visual
    debugger)
  • John Blatz (program transformations)
  • Asheesh Laroia (web services)

64
New examples of dynamic programming in NLP
65
Some examples from my lab
  • Parsing using
  • factored dependency models (Dreyer, Smith,
    Smith CONLL06)
  • with annealed risk minimization (Smith and Eisner
    EMNLP06)
  • constraints on dependency length (Eisner Smith
    IWPT05)
  • unsupervised learning of deep transformations (see
    Eisner EMNLP02)
  • lexicalized algorithms (see Eisner Satta
    ACL99, etc.)
  • Grammar induction using
  • partial supervision (Dreyer Eisner EMNLP06)
  • structural annealing (Smith Eisner ACL06)
  • contrastive estimation (Smith Eisner GIA05)
  • deterministic annealing (Smith Eisner ACL04)
  • Machine translation using
  • Very large neighborhood search of
    permutations (Eisner Tromble, NAACL-W06)
  • Loosely syntax-based MT (Smith Eisner in
    prep.)
  • Synchronous cross-lingual parsing (Smith Smith
    EMNLP04)
  • Finite-state methods for morphology, phonology,
    IE, even syntax
  • Unsupervised cognate discovery (Schafer
    Yarowsky 05, 06)
  • Unsupervised log-linear models via contrastive
    estimation (Smith Eisner ACL05)
  • Context-based morph. disambiguation (Smith,
    Smith Tromble EMNLP05)

- see also Eisner ACL03)
66
New examples of dynamic programming in NLP
  • Parameterized finite-state machines

67
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

68
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

69
Parameterized FSMs
  • An FSM whose arc probabilities depend on
    parameters they are formulas.

Expert first Construct the FSM (topology
parameterization). Automatic takes over Given
training data, find parameter valuesthat
optimize arc probs.
70
Parameterized FSMs
Knight Graehl 1997 - transliteration
71
Parameterized FSMs
Knight Graehl 1997 - transliteration
Would like to get some of that expert knowledge
in here Use probabilistic regexps like(a.7 b)
.5 (ab.6) If the probabilities are
variables (ax b) y (abz) then arc weights
of the compiled machine are nasty formulas.
(Especially after minimization!)
72
Finite-State Operations
  • Projection GIVES YOU marginal distribution

p(x,y)
domain(
)
73
Finite-State Operations
  • Probabilistic union GIVES YOU mixture model

p(x)
0.3
q(x)
74
Finite-State Operations
  • Probabilistic union GIVES YOU mixture model

?
p(x)
q(x)
Learn the mixture parameter ?!
75
Finite-State Operations
  • Composition GIVES YOU chain rule

p(xy)
o
p(yz)
  • The most popular statistical FSM operation
  • Cross-product construction

76
Finite-State Operations
  • Concatenation, probabilistic closure
    HANDLE unsegmented text

0.3
p(x)
p(x)
q(x)
  • Just glue together machines for the different
    segments, and let them figure out how to align
    with the text

77
Finite-State Operations
  • Directed replacement MODELS noise or
    postprocessing

p(x,y)
o
  • Resulting machine compensates for noise or
    postprocessing

78
Finite-State Operations
  • Intersection GIVES YOU product models
  • e.g., exponential / maxent, perceptron, Naïve
    Bayes,
  • Need a normalization op too computes ?x f(x)
    pathsum or
    partition function

p(x)

q(x)
  • Cross-product construction (like composition)

79
Finite-State Operations
  • Conditionalization (new operation)

p(x,y)
condit(
)
  • Resulting machine can be composed with other
    distributions p(y x) q(x)

80
New examples of dynamic programming in NLP
  • Parameterized infinite-state machines

81
Universal grammar as a parameterized FSA over an
infinite state space
82
New examples of dynamic programming in NLP
  • More abuses of finite-state machines

83
Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all candidates that include this
input.
Gen
voi
underlying tiers
C
C
V
C

voi
voi
surface tiers
C
C
V
C
V
C
C
V
C
voi
voi
C
C
V
C
C
C
V
C
velar
voi
V
C
C
V
C
C
C
C
C
C
C
84
Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
at each moment, need to describe whats going
on on many tiers
C
C
V
C
velar
V
C
C
C
C
C
C
85
Directional Best Paths construction
  • Keep best output string for each input string
  • Yields a new transducer (size ?? 3n)

For input abc abc axc For input abd axd
Must allow red arc just if next input is d
86
Minimization of semiring-weighted FSAs
  • New definition of ? for pushing
  • ?(q) weight of the shortest path from
    q, breaking ties alphabetically on input
    symbols
  • Computation is simple, well-defined, independent
    of (K, ?)
  • Breadth-first search back from final states

Compute ?(q) in O(1) time as soon as we visit
q. Whole alg. is linear.
b
b
a
b
a
c
d
c
Faster than finding min-weight path à la Mohri.
distance 2
?(q) k ? ?(r)
87
New examples of dynamic programming in NLP
  • Tree-to-tree alignment

88
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91
New examples of dynamic programming in NLP
  • Bilexical parsing in O(n3)
  • (with Giorgio Satta)

92
Lexicalized CKY
loves
Mary
the
girl
outdoors
93
Lexicalized CKY is O(n5) not O(n3)
... advocate
visiting relatives
... hug
visiting relatives
B
C
i
j
j1
k
O(n3 combinations)
94
Idea 1
  • Combine B with what C?
  • must try different-width Cs (vary k)
  • must try differently-headed Cs (vary h)
  • Separate these!

95
Idea 1
(the old CKY way)
96
Idea 2
  • Some grammars allow

97
Idea 2
  • Combine what B and C?
  • must try different-width Cs (vary k)
  • must try different midpoints j
  • Separate these!

98
Idea 2
(the old CKY way)
99
Idea 2
B
j
h
(the old CKY way)
A
C
h
h
A
h
k
100
An O(n3) algorithm (with G. Satta)
loves
Mary
the
girl
outdoors





101
(No Transcript)
102
New examples of dynamic programming in NLP
  • O(n)-time partial parsing by limiting dependency
    length
  • (with Noah A. Smith)

103
Short-Dependency Preference
  • A words dependents (adjuncts, arguments)
  • tend to fall near it
  • in the string.

104
length of a dependency surface distance
3
1
1
1
105
50 of English dependencies have length 1,
another 20 have length 2, 10 have length 3 ...
fraction of all dependencies
length
106
Related Ideas
  • Score parses based on whats between a head and
    child
  • (Collins, 1997 Zeman, 2004 McDonald et al.,
    2005)
  • Assume short ? faster human processing
  • (Church, 1980 Gibson, 1998)
  • Attach low heuristic for PPs (English)
  • (Frazier, 1979 Hobbs and Bear, 1990)
  • Obligatory and optional re-orderings (English)
  • (see paper)

107
Going to Extremes
Longer dependencies are less likely.
What if we eliminate them completely?
108
Hard Constraints
  • Disallow dependencies between words of distance gt
    b ...
  • Risk best parse contrived, or no parse at all!
  • Solution allow fragments (partial parsing
    Hindle, 1990 inter alia).
  • Why not model the sequence of fragments?

109
Building a Vine SBG Parser
  • Grammar generates sequence of trees from
  • Parser recognizes sequences of trees without
    long dependencies
  • Need to modify training data
  • so the model is consistent
  • with the parser.

110

8
would
9
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
some
2
third
(from the Penn Treebank)
1
a
111

would
4
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 4
some
2
third
(from the Penn Treebank)
1
a
112

would
1
1
.
,
According
changes
cut
3
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 3
some
2
third
(from the Penn Treebank)
1
a
113

would
1
1
.
,
According
changes
cut
1
to
2
2
by
1
filings
2
the
rule
1
1
estimates
more
insider
1
1
than
b 2
some
2
third
(from the Penn Treebank)
1
a
114

would
1
1
.
,
According
changes
cut
1
to
by
1
filings
the
rule
1
1
estimates
more
insider
1
1
than
b 1
some
third
(from the Penn Treebank)
1
a
115

would
.
,
According
cut
changes
to
by
filings
the
rule
estimates
more
insider
than
b 0
some
third
(from the Penn Treebank)
a
116
Vine Grammar is Regular
  • Even for small b, bunches can grow to arbitrary
    size
  • But arbitrary center embedding is out

117
Vine Grammar is Regular
  • Could compile into an FSA and get O(n) parsing!
  • Problem whats the grammar constant?

EXPONENTIAL
  • insider has no parent
  • cut and would can have more children
  • can have more children

FSA
According to some estimates , the rule changes
would cut insider ...
118
Alternative
  • Instead, we adapt
  • an SBG chart parser
  • which implicitly shares fragments of stack state
  • to the vine case,
  • eliminating unnecessary work.

119
Limiting dependency length
  • Linear-time partial parsing

Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
  • Natural-language dependencies tend to be short
  • So even if you dont have enough data to model
    what the heads are
  • you might want to keep track of where they are.

120
Limiting dependency length
  • Linear-time partial parsing
  • Dont convert into an FSA!
  • Less structure sharing
  • Explosion of states for different stack
    configurations
  • Hard to get your parse back

Finite-state model of root sequence
NP
S
NP
Bounded dependencylength within each chunk (but
chunk could be arbitrarilywide right- or left-
branching)
121
Limiting dependency length
  • Linear-time partial parsing

NP
S
NP
Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
122
Limiting dependency length
  • Linear-time partial parsing

Each piece is at most k wordswide No
dependencies between pieces Finite state model
of sequence ? Linear time! O(k2n)
123
Quadratic Recognition/Parsing
goal
...
O(n2b)






...
O(n2b)
O(n3) combinations
only construct trapezoids such that k i b
i
j
i
j
k
k
O(nb2)
O(n3) combinations
i
j
i
j
k
k
124

would
.
,
According
changes
cut
O(nb) vine construction
b 4
  • According to some , the new changes would cut
    insider filings by more than a third .

all width 4
125
Parsing Algorithm
  • Same grammar constant as Eisner and Satta (1999)
  • O(n3) ? O(nb2) runtime
  • Includes some overhead (low-order term) for
    constructing the vine
  • Reality check ... is it worth it?

126
F-measure runtime of a limited-dependency-lengt
h parser (POS seqs)
127
Precision recall of a limited-dependency-length
parser (POS seqs)
128
Results Penn Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
129
Results Chinese Treebank
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
130
Results TIGER Corpus
evaluation against original ungrafted Treebank
non-punctuation only
b 20
b 1
131
Type-Specific Bounds
  • b can be specific to dependency type
  • e.g., b(V-O) can be longer than b(S-V)
  • b specific to parent, child, direction
  • gradually tighten based on training data

132
  • English 50 runtime, no loss
  • Chinese 55 runtime, no loss
  • German 44 runtime, 2 loss

133
Related Work
  • Nederhof (2000) surveys finite-state
    approximation of context-free languages.
  • CFG ? FSA
  • We limit all dependency lengths (not just
    center-embedding), and derive weights from the
    Treebank (not by approximation).
  • Chart parser ? reasonable grammar constant.

134
Softer Modeling of Dep. Length
When running parsing algorithm, just multiply in
these probabilities at the appropriate time.
p
DEFICIENT
p(3 r, a, L)
p(2 r, b, L)
p(1 b, c, R)
p
p(1 r, d, R)
p(1 d, e, R)
p(1 e, f, R)
135
Generating with SBGs

?w0
?w0
  1. Start with left wall
  2. Generate root w0
  3. Generate left children w-1, w-2, ..., w-l from
    the FSA ?w0
  4. Generate right children w1, w2, ..., wr from the
    FSA ?w0
  5. Recurse on each wi for i in -l, ..., -1, 1,
    ..., r, sampling ai (steps 2-4)
  6. Return al...a-1w0a1...ar

w0
w-1
w1
w-2
w2
...
...
?w-l
w-l
wr
w-l.-1
136
Very Simple Model for ?w and ?w
We parse POS tag sequences, not words.
p(child first, parent, direction) p(stop
first, parent, direction) p(child not first,
parent, direction) p(stop not first, parent,
direction)
?takes
?takes
It
takes
two
to
137
Baseline
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49
138
Modeling Dependency Length
test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49
76 62 75 67 103 31
4.1 1.6 -2.6 -26 -31 -37
length
139
Conclusion
  • Modeling dependency length can
  • cut runtime of simple models by 26-37
  • with effects ranging from
  • -3 to 4 on recall.
  • (Loss on recall perhaps due to deficient/MLE
    estimation.)

140
Future Work
apply to state-of-the-art parsing models

better parameter estimation
applications MT, IE, grammar induction
141
This Talk in a Nutshell
3
length of a dependency surface distance
1
1
1
  • Empirical results (English, Chinese, German)
  • Hard constraints cut runtime in half or more
    with no accuracy loss (English, Chinese) or by
    44 with -2.2 accuracy (German).
  • Soft constraints affect accuracy of simple
    models by -3 to 24 and cut runtime by 25 to
    40.
  • Formal results
  • A hard bound b on dependency length
  • results in a regular language.
  • allows O(nb2) parsing.

142
New examples of dynamic programming in NLP
  • Grammar induction by initially limiting
    dependency length
  • (with Noah A. Smith)

143
Soft bias toward short dependencies
dS j k
(j, k) in t
where p(t, xi) Z-1(d)pT(t, xi) e
MLE baseline
-8
d 0
8
linear structure preferred
144
Soft bias toward short dependencies
  • Multiply parse probability by exp -dS
  • where S is the total length of all dependencies
  • Then renormalize probabilities

MLE baseline
-8
d 0
8
linear structure preferred
145
Structural Annealing
MLE baseline
-8
d 0
8
Repeat ...
Increase d and retrain.
Until performance stops improving on a
small validation dataset.
Start here train a model.
146
Grammar Induction
Other structural biases can be annealed. We
tried annealing on connectivity ( of fragments),
and got similar results.
147
A 6/9-Accurate Parse
These errors look like ones made by a supervised
parser in 2000!
Treebank
can
gene
thus
the
prevent
plant
from
fertilizing
itself
a
MLE with locality bias
verb instead of modal as root
preposition misattachment
prevent
gene
plant
the
can
thus
a
from
fertilizing
itself
misattachment of adverb thus
148
Accuracy Improvements
language random tree Klein Manning (2004) Smith Eisner (2006)
German 27.5 50.3 70.0
English 30.3 41.6 61.8
Bulgarian 30.4 45.6 58.4
Mandarin 22.6 50.1 57.2
Turkish 29.8 48.0 62.4
Portuguese 30.6 42.3 71.8
state-of-the-art, supervised
82.61
90.92
85.91
84.61
69.61
86.51
1CoNLL-X shared task, best system. 2McDonald
et al., 2005
149
Combining with Contrastive Estimation
  • This generally gives us our best results

150
New examples of dynamic programming in NLP
  • Contrastive estimation for HMM and grammar
    induction
  • Uses lattice parsing
  • (with Noah A. Smith)

151
Contrastive EstimationTraining Log-Linear
Modelson Unlabeled Data
  • Noah A. Smith and Jason Eisner
  • Department of Computer Science /
  • Center for Language and Speech Processing
  • Johns Hopkins University
  • nasmith,jason_at_cs.jhu.edu

152
Contrastive Estimation(Efficiently) Training
Log-Linear Models (of Sequences) on Unlabeled Data
  • Noah A. Smith and Jason Eisner
  • Department of Computer Science /
  • Center for Language and Speech Processing
  • Johns Hopkins University
  • nasmith,jason_at_cs.jhu.edu

153
Nutshell Version
unannotated text
tractable training
contrastive estimation with lattice neighborhoods
Experiments on unlabeled data POS tagging 46
error rate reduction (relative to EM) Max ent
features make it possible to survive damage to
tag dictionary Dependency parsing 21
attachment error reduction (relative to EM)
max ent features
sequence models
154
Red leaves dont hide blue jays.
155
Maximum Likelihood Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?

p
?
S ?
156
Maximum Likelihood Estimation(Unsupervised)
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
x
?
This is what EM does.

p
?
S ?
157
Focusing Probability Mass
numerator
denominator
158
Conditional Estimation(Supervised)
y
JJ
NNS
MD
VB
JJ
NNS
p
red
leaves
dont
hide
blue
jays
x
?
?
?
?
?
?
A different denominator!
p
red
leaves
dont
hide
blue
jays
(x) ?
159
Objective Functions
Objective Optimization Algorithm Numerator Denominator
MLE Count Normalize tags words S ?
MLE with hidden variables EM words S ?
Conditional Likelihood Iterative Scaling tags words (words) ?
Perceptron Backprop tags words hypothesized tags words
generic numerical solvers (in this talk, LMVM
L-BFGS)
Contrastive Estimation
observed data (in this talk, raw word sequence,
sum over all possible taggings)
?
For generative models.
160
  • This talk is about denominators ...
  • in the unsupervised case.
  • A good denominator can improve
  • accuracy
  • and
  • tractability.

161
Language Learning (Syntax)
At last! My own language learning device!
Why did he pick that sequence for those
words? Why not say leaves red ... or ... hide
dont ... or ...
Why didnt he say, birds fly or dancing
granola or the wash dishes or any other
sequence of words?
EM
162
  • What is a syntax model supposed to explain?
  • Each learning hypothesis
  • corresponds to
  • a denominator / neighborhood.

163
The Job of Syntax
  • Explain why each word is necessary.
  • ? DEL1WORD neighborhood

164
The Job of Syntax
  • Explain the (local) order of the words.
  • ? TRANS1 neighborhood

165
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
sentences in TRANS1 neighborhood
p
166
?
?
?
?
?
?
p
red
leaves
dont
hide
blue
jays
www.dyna.org (shameless self promotion)
red
leaves
dont
hide
blue
jays
hide
jays
leaves
dont
blue
p
blue
hide
leaves
dont
red
dont
hide
blue
jays
(with any tagging)
sentences in TRANS1 neighborhood
167
The New Modeling Imperative
A good sentence hints that a set of bad ones is
nearby.
numerator
denominator (neighborhood)
Make the good sentence likely, at the expense
of those bad neighbors.
168
  • This talk is about denominators ...
  • in the unsupervised case.
  • A good denominator can improve
  • accuracy
  • and
  • tractability.

169
Log-Linear Models
score of x, y
partition function
Computing Z is undesirable!
Sums over all possible taggings of all possible
sentences!
Contrastive Estimation (Unsupervised)
Conditional Estimation (Supervised)
a few sentences
1 sentence
170
A Big Picture Sequence Model Estimation
unannotated data
tractable sums
generative, EM p(x)
generative, MLE p(x, y)
log-linear, CE with lattice neighborhoods
log-linear, EM p(x)
log-linear, conditional estimation p(y x)
log-linear, MLE p(x, y)
overlapping features
171
Contrastive Neighborhoods
  • Guide the learner toward models that do what
    syntax is supposed to do.
  • Lattice representation ? efficient algorithms.

There is an art to choosing neighborhood
functions.
172
Neighborhoods
neighborhood size lattice arcs perturbations
n1 O(n) delete up to 1 word
n O(n) transpose any bigram
O(n) O(n) ?
O(n2) O(n2) delete any contiguous subsequence
(EM) 8 - replace each word with anything
DEL1WORD
TRANS1
DELORTRANS1
DEL1WORD
TRANS1
DEL1SUBSEQUENCE
S
173
The Merialdo (1994) Task
  • Given unlabeled text
  • and a POS dictionary
  • (that tells all possible tags for each word
    type),
  • learn to tag.

A form of supervision.
174
Trigram Tagging Model
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary
175
CRF
log-linear EM
supervised
HMM
LENGTH
TRANS1
DELORTRANS1
DA
Smith Eisner (2004)
10 data
EM
Merialdo (1994)
EM
DEL1WORD
DEL1SUBSEQUENCE
random
  • 96K words
  • full POS dictionary
  • uninformative initializer
  • best of 8 smoothing conditions

176
  • Dictionary includes ...
  • all words
  • words from 1st half of corpus
  • words with count ? 2
  • words with count ? 3
  • Dictionary excludes
  • OOV words,
  • which can get any tag.

What if we damage the POS dictionary?
  • 96K words
  • 17 coarse POS tags
  • uninformative initializer

EM
random
LENGTH
DELORTRANS1
177
Trigram Tagging Model Spelling
JJ
NNS
MD
VB
JJ
NNS
red
leaves
dont
hide
blue
jays
feature set tag trigrams tag/word pairs from a
POS dictionary 1- to 3-character suffixes,
contains hyphen, digit
178
Log-linear spelling features aided recovery ...
... but only with a smart neighborhood.
EM
LENGTH spelling
random
LENGTH
DELORTRANS1 spelling
DELORTRANS1
179
  • The model need not be finite-state.

180
Unsupervised Dependency Parsing
Klein Manning (2004)
attachment accuracy
EM
LENGTH
TRANS1
initializer
181
To Sum Up ...
Contrastive Estimation means
picking your own denominator
for tractability
or for accuracy
(or, as in our case, for both).
Now we can use the task to guide the unsupervised
learner
(like discriminative techniques do for supervised
learners).
Its a particularly good fit for log-linear
models
with max ent features
unsupervised sequence models
all in time for ACL 2006.
182
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com