To view this presentation, you'll need to enable Flash.

Show me how

After you enable Flash, refresh this webpage and the presentation should play.

Loading...

PPT – Declarative Specification of NLP Systems PowerPoint presentation | free to download - id: 6e85b7-NTA1M

The Adobe Flash plugin is needed to view this content

View by Category

Presentations

Products
Sold on our sister site CrystalGraphics.com

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

Declarative Specification of NLP Systems

- Jason Eisner

student co-authors on various parts of this work

Eric Goldlust, Noah A. Smith, John Blatz, Roy

Tromble

IBM, May 2006

An Anecdote from ACL05

-Michael Jordan

An Anecdote from ACL05

-Michael Jordan

Conclusions to draw from that talk

- Mike his students are great.
- Graphical models are great.(because theyre

flexible) - Gibbs sampling is great.(because it works with

nearly any graphical model) - Matlab is great.(because it frees up Mike and

his students to doodle all day and then execute

their doodles)

Could NLP be this nice?

- Mike his students are great.
- Graphical models are great.(because theyre

flexible) - Gibbs sampling is great.(because it works with

nearly any graphical model) - Matlab is great.(because it frees up Mike and

his students to doodle all day and then execute

their doodles)

Could NLP be this nice?

- Parts of it already are
- Language modeling
- Binary classification (e.g., SVMs)
- Finite-state transductions
- Linear-chain graphical models

Toolkits available you dont have to be an expert

But other parts arent Context-free and

beyond Machine translation

Efficient parsers and MT systems are complicated

and painful to write

Could NLP be this nice?

- This talk A toolkit thats general enough for

these cases. - (stretches from finite-state to Turing machines)
- Dyna

But other parts arent Context-free and

beyond Machine translation

Efficient parsers and MT systems are complicated

and painful to write

Warning

- Lots more beyond this talk
- see the EMNLP05 and FG06 papers
- see http//dyna.org
- (download documentation)
- sign up for updates by email
- wait for the totally revamped next version ?

the case forLittle Languages

- declarative programming
- small is beautiful

Sapir-Whorf hypothesis

- Language shapes thought
- At least, it shapes conversation
- Computer language shapes thought
- At least, it shapes experimental research
- Lots of cute ideas that we never pursue
- Or if we do pursue them, it takes 6-12 months to

implement on large-scale data - Have we turned into a lab science?

Declarative Specifications

- State what is to be done
- (How should the computer do it? Turn that over

to a general solver that handles the

specification language.) - Hundreds of domain-specific little languages

out there. Some have sophisticated solvers.

dot (www.graphviz.org)

digraph g graph rankdir "LR" node

fontsize "16 shape "ellipse" edge

"node0" label "ltf0gt 0x10ba8 ltf1gt"shape

"record" "node1" label "ltf0gt 0xf7fc4380

ltf1gt ltf2gt -1"shape "record"

"node0"f0 -gt "node1"f0 id 0 "node0"f1

-gt "node2"f0 id 1 "node1"f0 -gt

"node3"f0 id 2

nodes

edges

Whats the hard part? Making a nice

layout! Actually, its NP-hard

dot (www.graphviz.org)

LilyPond (www.lilypond.org)

LilyPond (www.lilypond.org)

Declarative Specs in NLP

- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,

etc.) - Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk Sometimes its best to peek

under the shiny surface. Declarative methods are

still great, but should be layeredwe need them

one level lower, too.

Declarative Specs in NLP

- Regular expression (for a FST toolkit)
- Grammar (for a parser)
- Feature set (for a maxent distribution, SVM,

etc.)

Declarative Specification of Algorithms

How you build a system (big picture slide)

cool model

practical equations

PCFG

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

Wait a minute

Didnt I just implement something like this last

month?

chart management / indexing cache-conscious data

structures prioritization of partial solutions

(best-first, A) parameter management inside-outsi

de formulas different algorithms for training and

decoding conjugate gradient, annealing,

... parallelization?

I thought computers were supposed to automate

drudgery

How you build a system (big picture slide)

cool model

- Dyna language specifies these equations.
- Most programs just need to compute some values

from other values. Any order is ok. - Some programs also need to update the outputs if

the inputs change - spreadsheets, makefiles, email readers
- dynamic graph algorithms
- EM and other iterative optimization
- leave-one-out training of smoothing params

practical equations

PCFG

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

How you build a system (big picture slide)

cool model

practical equations

PCFG

Compilation strategies (well come back

to this)

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

Writing equations in Dyna

- int a.
- a b c.
- a will be kept up to date if b or c changes.
- b x.b y. equivalent to b xy.
- b is a sum of two variables. Also kept up to

date. - c z(1).c z(2).c z(3).
- c z(four).c z(foo(bar,5)).

c z(N).

c is a sum of all nonzero z() values. At

compile time, we dont know how many!

More interesting use of patterns

- a b c.
- scalar multiplication
- a(I) b(I) c(I).
- pointwise multiplication
- a b(I) c(I). means a b(I)c(I)
- dot product could be sparse
- a(I,K) b(I,J) c(J,K). b(I,J)c(J,K)
- matrix multiplication could be sparse
- J is free on the right-hand side, so we sum over

it

Dyna vs. Prolog

- By now you may see what were up to!
- Prolog has Horn clauses
- a(I,K) - b(I,J) , c(J,K).
- Dyna has Horn equations
- a(I,K) b(I,J) c(J,K).

Like Prolog Allow nested terms Syntactic sugar

for lists, etc. Turing-complete

Unlike Prolog Charts, not backtracking! Compile

? efficient C classes Integrates with your C

code

The CKY inside algorithm in Dyna

- double item 0. - bool length

false. constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J) constit(Y,I,Mid)

constit(Z,Mid,J) rewrite(X,Y,Z). goal

constit(s,0,N) if length(N).

using namespace cky chart c crewrite(s,np,

vp) 0.7 cword(Pierre,0,1)

1 clength(30) true // 30-word sentence cin

gtgt c // get more axioms from stdin cout ltlt

cgoal // print total weight of all parses

visual debugger browse the proof forest

ambiguity

shared substructure

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

max max max

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

max max max

log log log

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

c word(Pierre, 0, 1)

1

state(5)

state(9)

0.2

air/0.3

8

9

P/0.5

Pierre/0.2

5

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Just add words one at a time to the chart Check

at any time what can be derived from words so

far Similarly, dynamic grammars

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Again, no change to the Dyna program

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Basically, just add extra arguments to the terms

above

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Earleys algorithm in Dyna

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

magic templates transformation (as noted by

Minnen 1996)

Program transformations

cool model

Blatz Eisner (FG 2006) Lots of

equivalent ways to write a system of

equations! Transforming from one to another

mayimprove efficiency. Many parsing tricks

can be generalized into automatic

transformations that help other programs, too!

practical equations

PCFG

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

Related algorithms in Dyna?

constit(X,I,J) word(W,I,J)

rewrite(X,W). constit(X,I,J)

constit(Y,I,Mid) constit(Z,Mid,J)

rewrite(X,Y,Z). goal constit(s,0,N)

if length(N).

- Viterbi parsing?
- Logarithmic domain?
- Lattice parsing?
- Incremental (left-to-right) parsing?
- Log-linear parsing?
- Lexicalized or synchronous parsing?
- Earleys algorithm?
- Binarized CKY?

Rule binarization

constit(X,I,J) constit(Y,I,Mid)

constit(Z,Mid,J) rewrite(X,Y,Z).

X

Y

Z

Z

Y

Mid

J

I

Mid

Rule binarization

constit(X,I,J) constit(Y,I,Mid)

constit(Z,Mid,J) rewrite(X,Y,Z).

graphical models constraint programming multi-way

database join

More program transformations

- Examples that add new semantics
- Compute gradient (e.g., derive outside algorithm

from inside) - Compute upper bounds for A (e.g., Klein

Manning ACL03) - Coarse-to-fine (e.g., Johnson Charniak

NAACL06) - Examples that preserve semantics
- On-demand computation by analogy with Earleys

algorithm - On-the-fly composition of FSTs
- Left-corner filter for parsing
- Program specialization as unfolding e.g.,

compile out the grammar - Rearranging computations by analogy with

categorial grammar - Folding reinterpreted as slashed categories
- Speculative computation using slashed

categories - abstract away repeated computation to do it once

only by analogy with unary rule closure or

epsilon-closure - derives Eisner Satta ACL99 O(n3) bilexical

parser

How you build a system (big picture slide)

cool model

practical equations

PCFG

Propagate updates from right-to-left through the

equations. a.k.a. agenda algorithm forward

chaining bottom-up inference semi-naïve

bottom-up

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

use a general method

Bottom-up inference

agenda of pending updates

rules of program

s(I,K) np(I,J) vp(J,K)

pp(I,K) prep(I,J) np(J,K)

prep(I,3) ?

prep(2,3) 1.0

s(3,9) 0.15

s(3,7) 0.21

vp(5,K) ?

vp(5,9) 0.5

pp(2,5) 0.3

vp(5,7) 0.7

np(3,5) 0.3

we updated np(3,5)what else must therefore

change?

If np(3,5) hadnt been in the chart already, we

would have added it.

np(3,5) 0.1

no more matches to this query

0.3

chart of derived items with current values

How you build a system (big picture slide)

cool model

practical equations

PCFG

Whats going on under the hood?

pseudocode (execution order)

tuned C implementation (data structures, etc.)

for width from 2 to n for i from 0 to n-width

k iwidth for j from i1 to k-1

Compiler provides

agenda of pending updates

rules of program

s(I,K) np(I,J) vp(J,K)

np(3,5) 0.3

copy, compare, hashterms fast, via

integerization (interning)

efficient storage of terms (use native C types,

symbiotic storage, garbage collection,seriali

zation, )

chart of derived items with current values

Beware double-counting!

agenda of pending updates

combining with itself

rules of program

n(I,K) n(I,J) n(J,K)

n(5,5) 0.2

n(5,5) ?

n(5,5) 0.3

to makeanother copyof itself

epsilon constituent

If np(3,5) hadnt been in the chart already, we

would have added it.

chart of derived items with current values

Parameter training

objective functionas a theorems value

- Maximize some objective function.
- Use Dyna to compute the function.
- Then how do you differentiate it?
- for gradient ascent,conjugate gradient, etc.
- gradient also tells us the expected counts for

EM!

e.g., inside algorithm computes likelihood of the

sentence

- Two approaches
- Program transformation automatically derive the

outside formulas. - Back-propagation run the agenda algorithm

backwards. - works even with pruning, early stopping, etc.

What can Dyna do beyond CKY?

Some examples from my lab

- Parsing using
- factored dependency models (Dreyer, Smith,

Smith CONLL06) - with annealed risk minimization (Smith and Eisner

EMNLP06) - constraints on dependency length (Eisner Smith

IWPT05) - unsupervised learning of deep transformations (see

Eisner EMNLP02) - lexicalized algorithms (see Eisner Satta

ACL99, etc.) - Grammar induction using
- partial supervision (Dreyer Eisner EMNLP06)
- structural annealing (Smith Eisner ACL06)
- contrastive estimation (Smith Eisner GIA05)
- deterministic annealing (Smith Eisner ACL04)

- Machine translation using
- Very large neighborhood search of

permutations (Eisner Tromble, NAACL-W06) - Loosely syntax-based MT (Smith Eisner in

prep.) - Synchronous cross-lingual parsing (Smith Smith

EMNLP04) - Finite-state methods for morphology, phonology,

IE, even syntax - Unsupervised cognate discovery (Schafer

Yarowsky 05, 06) - Unsupervised log-linear models via contrastive

estimation (Smith Eisner ACL05) - Context-based morph. disambiguation (Smith,

Smith Tromble EMNLP05)

Easy to try stuff out! Programs are very short

easy to change!

- see also Eisner ACL03)

Can it express everything in NLP? ?

- Remember, it integrates tightly with C, so you

only have to use it where its helpful,and write

the rest in C. Small is beautiful. - Were currently extending the class of allowed

formulas beyond the semiring - cf. Goodman (1999)
- will be able to express smoothing, neural nets,

etc. - Of course, it is Turing complete ?

Smoothing in Dyna

- mle_prob(X,Y,Z) context

count(X,Y,Z)/count(X,Y). - smoothed_prob(X,Y,Z) lambdamle_prob(X,Y,Z)

(1-lambda)mle_prob(Y,Z). - for arbitrary n-grams, can use lists
- count_count(N) 1 whenever N is

count(Anything). - updates automatically during leave-one-out

jackknifing

Information retrieval in Dyna

- score(Doc) tf(Doc,Word)tf(Query,Word)idf(Wor

d). - idf(Word) 1/log(df(Word)).
- df(Word) 1 whenever tf(Doc,Word) gt 0.

Neural networks in Dyna

- out(Node) sigmoid(in(Node)).
- in(Node) input(Node).
- in(Node) weight(Node,Kid)out(Kid).
- error (out(Node)-target(Node))2

if ?target(Node). - Recurrent neural net is ok

Game-tree analysis in Dyna

- goal best(Board) if start(Board).
- best(Board) max stop(player1, Board).
- best(Board) max move(player1, Board, NewBoard)

worst(NewBoard). - worst(Board) min stop(player2, Board).
- worst(Board) min move(player2, Board, NewBoard)

best(NewBoard).

Weighted FST composition in Dyna(epsilon-free

case)

- - bool itemfalse.
- start (A o B, Q x R) start (A, Q) start (B,

R). - stop (A o B, Q x R) stop (A, Q) stop (B, R).

- arc (A o B, Q1 x R1, Q2 x R2, In, Out) arc

(A, Q1, Q2, In, Match) arc (B, R1, R2,

Match, Out). - Inefficient? How do we fix this?

Constraint programming (arc consistency)

- - bool indomainfalse.
- - bool consistenttrue.
- variable(Var) indomain(VarVal).
- possible(VarVal) indomain(VarVal).
- possible(VarVal) support(VarVal, Var2)

whenever variable(Var2). - support(VarVal, Var2) possible(Var2Val2)

consistent(VarVal, Var2Val2).

Edit distance in Dyna version 1

- letter1(c,0,1). letter1(l,1,2).

letter1(a,2,3). clara - letter2(c,0,1). letter2(a,1,2).

letter2(c,2,3). caca - end1(5). end2(4). delcost 1. inscost 1.

substcost 1. - align(0,0) 0.
- align(I1,J2) min align(I1,I2)

letter2(L2,I2,J2) inscost(L2). - align(J1,I2) min align(I1,I2)

letter1(L1,I1,J1) delcost(L1). - align(J1,J2) min align(I1,I2)

letter1(L1,I1,J1) letter2(L2,I2,J2)

subcost(L1,L2). - align(J1,J2) min align(I1,I2)letter1(L,I1,J1)le

tter2(L,I2,J2). - goal align(N1,N2) whenever end1(N1) end2(N2).

Edit distance in Dyna version 2

- input(c, l, a, r, a, c, a, c,

a) 0. - delcost 1. inscost 1. substcost 1.
- alignupto(Xs,Ys) min input(Xs,Ys).
- alignupto(Xs,Ys) min alignupto(XXs,Ys)

delcost. - alignupto(Xs,Ys) min alignupto(Xs,YYs)

inscost. - alignupto(Xs,Ys) min alignupto(XXs,YYs)sub

stcost. - alignupto(Xs,Ys) min alignupto(AXs,AYs).
- goal min alignupto(, ).

How about different costs for different letters?

Edit distance in Dyna version 2

- input(c, l, a, r, a, c, a, c,

a) 0. - delcost 1. inscost 1. substcost 1.
- alignupto(Xs,Ys) min input(Xs,Ys).
- alignupto(Xs,Ys) min alignupto(XXs,Ys)

delcost. - alignupto(Xs,Ys) min alignupto(Xs,YYs)

inscost. - alignupto(Xs,Ys) min alignupto(XXs,YYs)sub

stcost. - alignupto(Xs,Ys) min alignupto(LXs,LYs).
- goal min alignupto(, ).

(X).

(Y).

(X,Y).

Is it fast enough?

(sort of)

- Asymptotically efficient
- 4 times slower than Mark Johnsons inside-outside
- 4-11 times slower than Klein Mannings Viterbi

parser

Are you going to make it faster?

(yup!)

- Currently rewriting the term classes to match

hand-tuned code - Will support mix-and-matchimplementation

strategies - store X in an array
- store Y in a hash
- dont store Z (compute on demand)
- Eventually, choose strategies automaticallyby

execution profiling

Synopsis your idea ? experimental results fast!

- Dyna is a language for computation (no I/O).
- Especially good for dynamic programming.
- It tries to encapsulate the black art of NLP.
- Much prior work in this vein
- Deductive parsing schemata (preferably weighted)
- Goodman, Nederhof, Pereira, Warren, Shieber,

Schabes, Sikkel - Deductive databases (preferably with aggregation)
- Ramakrishnan, Zukowski, Freitag, Specht, Ross,

Sagiv, - Probabilistic programming languages (implemented)
- Zhao, Sato, Pfeffer (also efficient Prologish

languages)

Dyna contributors!

- Jason Eisner
- Eric Goldlust, Eric Northup, Johnny Graettinger

(compiler backend) - Noah A. Smith (parameter training)
- Markus Dreyer, David Smith (compiler frontend)
- Mike Kornbluh, George Shafer, Gordon Woodhull,

Constantinos Michael, Ray Buse (visual

debugger) - John Blatz (program transformations)
- Asheesh Laroia (web services)

New examples of dynamic programming in NLP

Some examples from my lab

- Parsing using
- factored dependency models (Dreyer, Smith,

Smith CONLL06) - with annealed risk minimization (Smith and Eisner

EMNLP06) - constraints on dependency length (Eisner Smith

IWPT05) - unsupervised learning of deep transformations (see

Eisner EMNLP02) - lexicalized algorithms (see Eisner Satta

ACL99, etc.) - Grammar induction using
- partial supervision (Dreyer Eisner EMNLP06)
- structural annealing (Smith Eisner ACL06)
- contrastive estimation (Smith Eisner GIA05)
- deterministic annealing (Smith Eisner ACL04)

- Machine translation using
- Very large neighborhood search of

permutations (Eisner Tromble, NAACL-W06) - Loosely syntax-based MT (Smith Eisner in

prep.) - Synchronous cross-lingual parsing (Smith Smith

EMNLP04) - Finite-state methods for morphology, phonology,

IE, even syntax - Unsupervised cognate discovery (Schafer

Yarowsky 05, 06) - Unsupervised log-linear models via contrastive

estimation (Smith Eisner ACL05) - Context-based morph. disambiguation (Smith,

Smith Tromble EMNLP05)

- see also Eisner ACL03)

New examples of dynamic programming in NLP

- Parameterized finite-state machines

Parameterized FSMs

- An FSM whose arc probabilities depend on

parameters they are formulas.

Parameterized FSMs

- An FSM whose arc probabilities depend on

parameters they are formulas.

Parameterized FSMs

- An FSM whose arc probabilities depend on

parameters they are formulas.

Expert first Construct the FSM (topology

parameterization). Automatic takes over Given

training data, find parameter valuesthat

optimize arc probs.

Parameterized FSMs

Knight Graehl 1997 - transliteration

Parameterized FSMs

Knight Graehl 1997 - transliteration

Would like to get some of that expert knowledge

in here Use probabilistic regexps like(a.7 b)

.5 (ab.6) If the probabilities are

variables (ax b) y (abz) then arc weights

of the compiled machine are nasty formulas.

(Especially after minimization!)

Finite-State Operations

- Projection GIVES YOU marginal distribution

p(x,y)

domain(

)

Finite-State Operations

- Probabilistic union GIVES YOU mixture model

p(x)

0.3

q(x)

Finite-State Operations

- Probabilistic union GIVES YOU mixture model

?

p(x)

q(x)

Learn the mixture parameter ?!

Finite-State Operations

- Composition GIVES YOU chain rule

p(xy)

o

p(yz)

- The most popular statistical FSM operation
- Cross-product construction

Finite-State Operations

- Concatenation, probabilistic closure

HANDLE unsegmented text

0.3

p(x)

p(x)

q(x)

- Just glue together machines for the different

segments, and let them figure out how to align

with the text

Finite-State Operations

- Directed replacement MODELS noise or

postprocessing

p(x,y)

o

- Resulting machine compensates for noise or

postprocessing

Finite-State Operations

- Intersection GIVES YOU product models
- e.g., exponential / maxent, perceptron, Naïve

Bayes,

- Need a normalization op too computes ?x f(x)

pathsum or

partition function

p(x)

q(x)

- Cross-product construction (like composition)

Finite-State Operations

- Conditionalization (new operation)

p(x,y)

condit(

)

- Resulting machine can be composed with other

distributions p(y x) q(x)

New examples of dynamic programming in NLP

- Parameterized infinite-state machines

Universal grammar as a parameterized FSA over an

infinite state space

New examples of dynamic programming in NLP

- More abuses of finite-state machines

Huge-alphabet FSAs for OT phonology

etc.

Gen proposes all candidates that include this

input.

Gen

voi

underlying tiers

C

C

V

C

voi

voi

surface tiers

C

C

V

C

V

C

C

V

C

voi

voi

C

C

V

C

C

C

V

C

velar

voi

V

C

C

V

C

C

C

C

C

C

C

Huge-alphabet FSAs for OT phonology

encode this candidate as a string

voi

at each moment, need to describe whats going

on on many tiers

C

C

V

C

velar

V

C

C

C

C

C

C

Directional Best Paths construction

- Keep best output string for each input string
- Yields a new transducer (size ?? 3n)

For input abc abc axc For input abd axd

Must allow red arc just if next input is d

Minimization of semiring-weighted FSAs

- New definition of ? for pushing
- ?(q) weight of the shortest path from

q, breaking ties alphabetically on input

symbols - Computation is simple, well-defined, independent

of (K, ?) - Breadth-first search back from final states

Compute ?(q) in O(1) time as soon as we visit

q. Whole alg. is linear.

b

b

a

b

a

c

d

c

Faster than finding min-weight path à la Mohri.

distance 2

?(q) k ? ?(r)

New examples of dynamic programming in NLP

- Tree-to-tree alignment

Synchronous Tree Substitution Grammar

Two training trees, showing a free translation

from French to English.

beaucoup denfants donnent un baiser à Sam ?

kids kiss Sam quite often

Synchronous Tree Substitution Grammar

Two training trees, showing a free translation

from French to English. A possible alignment is

shown in orange.

donnent (give)

kiss

à (to)

Sam

baiser (kiss)

Sam

often

kids

un (a)

beaucoup(lots)

quite

NP

d (of)

NP

enfants (kids)

beaucoup denfants donnent un baiser à Sam ?

kids kiss Sam quite often

Synchronous Tree Substitution Grammar

Two training trees, showing a free translation

from French to English. A possible alignment is

shown in orange. Alignment shows how trees are

generated synchronously from little trees ...

beaucoup denfants donnent un baiser à Sam ?

kids kiss Sam quite often

New examples of dynamic programming in NLP

- Bilexical parsing in O(n3)
- (with Giorgio Satta)

Lexicalized CKY

loves

Mary

the

girl

outdoors

Lexicalized CKY is O(n5) not O(n3)

... advocate

visiting relatives

... hug

visiting relatives

B

C

i

j

j1

k

O(n3 combinations)

Idea 1

- Combine B with what C?
- must try different-width Cs (vary k)
- must try differently-headed Cs (vary h)
- Separate these!

Idea 1

(the old CKY way)

Idea 2

- Some grammars allow

Idea 2

- Combine what B and C?
- must try different-width Cs (vary k)
- must try different midpoints j
- Separate these!

Idea 2

(the old CKY way)

Idea 2

B

j

h

(the old CKY way)

A

C

h

h

A

h

k

An O(n3) algorithm (with G. Satta)

loves

Mary

the

girl

outdoors

(No Transcript)

New examples of dynamic programming in NLP

- O(n)-time partial parsing by limiting dependency

length - (with Noah A. Smith)

Short-Dependency Preference

- A words dependents (adjuncts, arguments)
- tend to fall near it
- in the string.

length of a dependency surface distance

3

1

1

1

50 of English dependencies have length 1,

another 20 have length 2, 10 have length 3 ...

fraction of all dependencies

length

Related Ideas

- Score parses based on whats between a head and

child - (Collins, 1997 Zeman, 2004 McDonald et al.,

2005) - Assume short ? faster human processing
- (Church, 1980 Gibson, 1998)
- Attach low heuristic for PPs (English)
- (Frazier, 1979 Hobbs and Bear, 1990)
- Obligatory and optional re-orderings (English)
- (see paper)

Going to Extremes

Longer dependencies are less likely.

What if we eliminate them completely?

Hard Constraints

- Disallow dependencies between words of distance gt

b ... - Risk best parse contrived, or no parse at all!
- Solution allow fragments (partial parsing

Hindle, 1990 inter alia). - Why not model the sequence of fragments?

Building a Vine SBG Parser

- Grammar generates sequence of trees from
- Parser recognizes sequences of trees without

long dependencies - Need to modify training data
- so the model is consistent
- with the parser.

8

would

9

4

1

1

.

,

According

changes

cut

3

1

to

2

2

by

1

filings

2

the

rule

1

1

estimates

more

insider

1

1

than

some

2

third

(from the Penn Treebank)

1

a

would

4

1

1

.

,

According

changes

cut

3

1

to

2

2

by

1

filings

2

the

rule

1

1

estimates

more

insider

1

1

than

b 4

some

2

third

(from the Penn Treebank)

1

a

would

1

1

.

,

According

changes

cut

3

1

to

2

2

by

1

filings

2

the

rule

1

1

estimates

more

insider

1

1

than

b 3

some

2

third

(from the Penn Treebank)

1

a

would

1

1

.

,

According

changes

cut

1

to

2

2

by

1

filings

2

the

rule

1

1

estimates

more

insider

1

1

than

b 2

some

2

third

(from the Penn Treebank)

1

a

would

1

1

.

,

According

changes

cut

1

to

by

1

filings

the

rule

1

1

estimates

more

insider

1

1

than

b 1

some

third

(from the Penn Treebank)

1

a

would

.

,

According

cut

changes

to

by

filings

the

rule

estimates

more

insider

than

b 0

some

third

(from the Penn Treebank)

a

Vine Grammar is Regular

- Even for small b, bunches can grow to arbitrary

size - But arbitrary center embedding is out

Vine Grammar is Regular

- Could compile into an FSA and get O(n) parsing!
- Problem whats the grammar constant?

EXPONENTIAL

- insider has no parent
- cut and would can have more children
- can have more children

FSA

According to some estimates , the rule changes

would cut insider ...

Alternative

- Instead, we adapt
- an SBG chart parser
- which implicitly shares fragments of stack state
- to the vine case,
- eliminating unnecessary work.

Limiting dependency length

- Linear-time partial parsing

Finite-state model of root sequence

NP

S

NP

Bounded dependencylength within each chunk (but

chunk could be arbitrarilywide right- or left-

branching)

- Natural-language dependencies tend to be short
- So even if you dont have enough data to model

what the heads are - you might want to keep track of where they are.

Limiting dependency length

- Linear-time partial parsing
- Dont convert into an FSA!
- Less structure sharing
- Explosion of states for different stack

configurations - Hard to get your parse back

Finite-state model of root sequence

NP

S

NP

Bounded dependencylength within each chunk (but

chunk could be arbitrarilywide right- or left-

branching)

Limiting dependency length

- Linear-time partial parsing

NP

S

NP

Each piece is at most k wordswide No

dependencies between pieces Finite state model

of sequence ? Linear time! O(k2n)

Limiting dependency length

- Linear-time partial parsing

Each piece is at most k wordswide No

dependencies between pieces Finite state model

of sequence ? Linear time! O(k2n)

Quadratic Recognition/Parsing

goal

...

O(n2b)

...

O(n2b)

O(n3) combinations

only construct trapezoids such that k i b

i

j

i

j

k

k

O(nb2)

O(n3) combinations

i

j

i

j

k

k

would

.

,

According

changes

cut

O(nb) vine construction

b 4

- According to some , the new changes would cut

insider filings by more than a third .

all width 4

Parsing Algorithm

- Same grammar constant as Eisner and Satta (1999)
- O(n3) ? O(nb2) runtime
- Includes some overhead (low-order term) for

constructing the vine - Reality check ... is it worth it?

F-measure runtime of a limited-dependency-lengt

h parser (POS seqs)

Precision recall of a limited-dependency-length

parser (POS seqs)

Results Penn Treebank

evaluation against original ungrafted Treebank

non-punctuation only

b 20

b 1

Results Chinese Treebank

evaluation against original ungrafted Treebank

non-punctuation only

b 20

b 1

Results TIGER Corpus

evaluation against original ungrafted Treebank

non-punctuation only

b 20

b 1

Type-Specific Bounds

- b can be specific to dependency type
- e.g., b(V-O) can be longer than b(S-V)
- b specific to parent, child, direction
- gradually tighten based on training data

- English 50 runtime, no loss
- Chinese 55 runtime, no loss
- German 44 runtime, 2 loss

Related Work

- Nederhof (2000) surveys finite-state

approximation of context-free languages. - CFG ? FSA
- We limit all dependency lengths (not just

center-embedding), and derive weights from the

Treebank (not by approximation). - Chart parser ? reasonable grammar constant.

Softer Modeling of Dep. Length

When running parsing algorithm, just multiply in

these probabilities at the appropriate time.

p

DEFICIENT

p(3 r, a, L)

p(2 r, b, L)

p(1 b, c, R)

p

p(1 r, d, R)

p(1 d, e, R)

p(1 e, f, R)

Generating with SBGs

?w0

?w0

- Start with left wall
- Generate root w0
- Generate left children w-1, w-2, ..., w-l from

the FSA ?w0 - Generate right children w1, w2, ..., wr from the

FSA ?w0 - Recurse on each wi for i in -l, ..., -1, 1,

..., r, sampling ai (steps 2-4) - Return al...a-1w0a1...ar

w0

w-1

w1

w-2

w2

...

...

?w-l

w-l

wr

w-l.-1

Very Simple Model for ?w and ?w

We parse POS tag sequences, not words.

p(child first, parent, direction) p(stop

first, parent, direction) p(child not first,

parent, direction) p(stop not first, parent,

direction)

?takes

?takes

It

takes

two

to

Baseline

test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49

Modeling Dependency Length

test set recall () test set recall () test set recall () test set runtime (items/word) test set runtime (items/word) test set runtime (items/word)

73 61 77 90 149 49

76 62 75 67 103 31

4.1 1.6 -2.6 -26 -31 -37

length

Conclusion

- Modeling dependency length can
- cut runtime of simple models by 26-37
- with effects ranging from
- -3 to 4 on recall.
- (Loss on recall perhaps due to deficient/MLE

estimation.)

Future Work

apply to state-of-the-art parsing models

better parameter estimation

applications MT, IE, grammar induction

This Talk in a Nutshell

3

length of a dependency surface distance

1

1

1

- Empirical results (English, Chinese, German)
- Hard constraints cut runtime in half or more

with no accuracy loss (English, Chinese) or by

44 with -2.2 accuracy (German). - Soft constraints affect accuracy of simple

models by -3 to 24 and cut runtime by 25 to

40.

- Formal results
- A hard bound b on dependency length
- results in a regular language.
- allows O(nb2) parsing.

New examples of dynamic programming in NLP

- Grammar induction by initially limiting

dependency length - (with Noah A. Smith)

Soft bias toward short dependencies

dS j k

(j, k) in t

where p(t, xi) Z-1(d)pT(t, xi) e

MLE baseline

-8

d 0

8

linear structure preferred

Soft bias toward short dependencies

- Multiply parse probability by exp -dS
- where S is the total length of all dependencies
- Then renormalize probabilities

MLE baseline

-8

d 0

8

linear structure preferred

Structural Annealing

MLE baseline

-8

d 0

8

Repeat ...

Increase d and retrain.

Until performance stops improving on a

small validation dataset.

Start here train a model.

Grammar Induction

Other structural biases can be annealed. We

tried annealing on connectivity ( of fragments),

and got similar results.

A 6/9-Accurate Parse

These errors look like ones made by a supervised

parser in 2000!

Treebank

can

gene

thus

the

prevent

plant

from

fertilizing

itself

a

MLE with locality bias

verb instead of modal as root

preposition misattachment

prevent

gene

plant

the

can

thus

a

from

fertilizing

itself

misattachment of adverb thus

Accuracy Improvements

language random tree Klein Manning (2004) Smith Eisner (2006)

German 27.5 50.3 70.0

English 30.3 41.6 61.8

Bulgarian 30.4 45.6 58.4

Mandarin 22.6 50.1 57.2

Turkish 29.8 48.0 62.4

Portuguese 30.6 42.3 71.8

state-of-the-art, supervised

82.61

90.92

85.91

84.61

69.61

86.51

1CoNLL-X shared task, best system. 2McDonald

et al., 2005

Combining with Contrastive Estimation

- This generally gives us our best results

New examples of dynamic programming in NLP

- Contrastive estimation for HMM and grammar

induction - Uses lattice parsing
- (with Noah A. Smith)

Contrastive EstimationTraining Log-Linear

Modelson Unlabeled Data

- Noah A. Smith and Jason Eisner
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- nasmith,jason_at_cs.jhu.edu

Contrastive Estimation(Efficiently) Training

Log-Linear Models (of Sequences) on Unlabeled Data

- Noah A. Smith and Jason Eisner
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- nasmith,jason_at_cs.jhu.edu

Nutshell Version

unannotated text

tractable training

contrastive estimation with lattice neighborhoods

Experiments on unlabeled data POS tagging 46

error rate reduction (relative to EM) Max ent

features make it possible to survive damage to

tag dictionary Dependency parsing 21

attachment error reduction (relative to EM)

max ent features

sequence models

Red leaves dont hide blue jays.

Maximum Likelihood Estimation(Supervised)

y

JJ

NNS

MD

VB

JJ

NNS

p

red

leaves

dont

hide

blue

jays

x

?

p

?

S ?

Maximum Likelihood Estimation(Unsupervised)

?

?

?

?

?

?

p

red

leaves

dont

hide

blue

jays

x

?

This is what EM does.

p

?

S ?

Focusing Probability Mass

numerator

denominator

Conditional Estimation(Supervised)

y

JJ

NNS

MD

VB

JJ

NNS

p

red

leaves

dont

hide

blue

jays

x

?

?

?

?

?

?

A different denominator!

p

red

leaves

dont

hide

blue

jays

(x) ?

Objective Functions

Objective Optimization Algorithm Numerator Denominator

MLE Count Normalize tags words S ?

MLE with hidden variables EM words S ?

Conditional Likelihood Iterative Scaling tags words (words) ?

Perceptron Backprop tags words hypothesized tags words

generic numerical solvers (in this talk, LMVM

L-BFGS)

Contrastive Estimation

observed data (in this talk, raw word sequence,

sum over all possible taggings)

?

For generative models.

- This talk is about denominators ...
- in the unsupervised case.
- A good denominator can improve
- accuracy
- and
- tractability.

Language Learning (Syntax)

At last! My own language learning device!

Why did he pick that sequence for those

words? Why not say leaves red ... or ... hide

dont ... or ...

Why didnt he say, birds fly or dancing

granola or the wash dishes or any other

sequence of words?

EM

- What is a syntax model supposed to explain?
- Each learning hypothesis
- corresponds to
- a denominator / neighborhood.

The Job of Syntax

- Explain why each word is necessary.
- ? DEL1WORD neighborhood

The Job of Syntax

- Explain the (local) order of the words.
- ? TRANS1 neighborhood

?

?

?

?

?

?

p

red

leaves

dont

hide

blue

jays

sentences in TRANS1 neighborhood

p

?

?

?

?

?

?

p

red

leaves

dont

hide

blue

jays

www.dyna.org (shameless self promotion)

red

leaves

dont

hide

blue

jays

hide

jays

leaves

dont

blue

p

blue

hide

leaves

dont

red

dont

hide

blue

jays

(with any tagging)

sentences in TRANS1 neighborhood

The New Modeling Imperative

A good sentence hints that a set of bad ones is

nearby.

numerator

denominator (neighborhood)

Make the good sentence likely, at the expense

of those bad neighbors.

- This talk is about denominators ...
- in the unsupervised case.
- A good denominator can improve
- accuracy
- and
- tractability.

Log-Linear Models

score of x, y

partition function

Computing Z is undesirable!

Sums over all possible taggings of all possible

sentences!

Contrastive Estimation (Unsupervised)

Conditional Estimation (Supervised)

a few sentences

1 sentence

A Big Picture Sequence Model Estimation

unannotated data

tractable sums

generative, EM p(x)

generative, MLE p(x, y)

log-linear, CE with lattice neighborhoods

log-linear, EM p(x)

log-linear, conditional estimation p(y x)

log-linear, MLE p(x, y)

overlapping features

Contrastive Neighborhoods

- Guide the learner toward models that do what

syntax is supposed to do. - Lattice representation ? efficient algorithms.

There is an art to choosing neighborhood

functions.

Neighborhoods

neighborhood size lattice arcs perturbations

n1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n) ?

O(n2) O(n2) delete any contiguous subsequence

(EM) 8 - replace each word with anything

DEL1WORD

TRANS1

DELORTRANS1

DEL1WORD

TRANS1

DEL1SUBSEQUENCE

S

The Merialdo (1994) Task

- Given unlabeled text
- and a POS dictionary
- (that tells all possible tags for each word

type), - learn to tag.

A form of supervision.

Trigram Tagging Model

JJ

NNS

MD

VB

JJ

NNS

red

leaves

dont

hide

blue

jays

feature set tag trigrams tag/word pairs from a

POS dictionary

CRF

log-linear EM

supervised

HMM

LENGTH

TRANS1

DELORTRANS1

DA

Smith Eisner (2004)

10 data

EM

Merialdo (1994)

EM

DEL1WORD

DEL1SUBSEQUENCE

random

- 96K words
- full POS dictionary
- uninformative initializer
- best of 8 smoothing conditions

- Dictionary includes ...
- all words
- words from 1st half of corpus
- words with count ? 2
- words with count ? 3
- Dictionary excludes
- OOV words,
- which can get any tag.

What if we damage the POS dictionary?

- 96K words
- 17 coarse POS tags
- uninformative initializer

EM

random

LENGTH

DELORTRANS1

Trigram Tagging Model Spelling

JJ

NNS

MD

VB

JJ

NNS

red

leaves

dont

hide

blue

jays

feature set tag trigrams tag/word pairs from a

POS dictionary 1- to 3-character suffixes,

contains hyphen, digit

Log-linear spelling features aided recovery ...

... but only with a smart neighborhood.

EM

LENGTH spelling

random

LENGTH

DELORTRANS1 spelling

DELORTRANS1

- The model need not be finite-state.

Unsupervised Dependency Parsing

Klein Manning (2004)

attachment accuracy

EM

LENGTH

TRANS1

initializer

To Sum Up ...

Contrastive Estimation means

picking your own denominator

for tractability

or for accuracy

(or, as in our case, for both).

Now we can use the task to guide the unsupervised

learner

(like discriminative techniques do for supervised

learners).

Its a particularly good fit for log-linear

models

with max ent features

unsupervised sequence models

all in time for ACL 2006.

(No Transcript)

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Presentation Removal Request Contact Us Send Us Feedback

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "Declarative Specification of NLP Systems" is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!