Title: Finite-State Methods in Natural Language Processing
1Finite-State Methods in Natural Language
Processing
- Lauri Karttunen
- LSA 2005 Summer Institute
- August 3, 2005
2- August 1
- Non-concatenative morphotactics
- Reduplication, interdigitation
- Realizational morphology
- Readings
- Chapter 8. Non-Concatenative Morphotactics
- Gregory T. Stump. Inflectional Morphology. A
Theory of Paradigm Structure. Cambridge U. Press.
2001. (An excerpt) - Lauri Karttunen, Computing with Realizational
Morphology, Lecture Notes in Computer Science,
Volume 2588, Alexander Gelbukh (ed.), 205-216,
Springer Verlag. 2003. - August 3
- Optimality theory
- Readings
- Paul Kiparsky Finnish Noun Inflection
Generative Approaches to Finnic and Saami
Linguistics, Diane Nelson and Satu Manninen
(eds.), pp.109-161, CSLI Publications, 2003. - Nine Elenbaas and René Kager. "Ternary rhythm and
the lapse constraint". Phonology 16. 273-329.
3Background
- Two old strains of finite-state (morpho)phonology
- rewrite rules (ChomskyHalle 1968)
- two-level constraints (Koskenniemi 1983)
- Optimality theory (Prince Smolensky 1993)
- two-level model with ranked, violable constraints
- Formal Power
- OT is not a finite-state system if it involves
unlimited counting of constraint violations.
(Ellison 1994, Eisner 1997, FrankSatta 1998) - But a finite-state model can be useful for OT.
4Optimality theory
- Prince Smolensky 1993
- eliminate
- rules
- derivations
- introduce
- violable ranked constraints
- Instant success!
5Brief Introduction to OT
- Input
- A language of underlying lexical forms.
- GEN
- A function that generates alternate surface
realizations for each input form, possibly an
infinite set. - Constraints
- A finite set of principles, preferrably
universal, that filter out unwanted realizations. - Ranking
- A language-specific ordering of the constraints.
6Computational perspective
- Ellison 1994
- OT deals with regular sets and relations a
finite-state system - constraint transducers mark violations, marks
sorted and counted - Tesar 1995
- dynamic algorithm for optimal path computations
- Eisner 1996
- two-level typology of optimality constraints
restrict, prohibit - FootForm Decomposed MIT Working Papers in
Linguistics, 31115-143 proposes Primitive
Optimality Theory (no generalized alignment) - Karttunen 1998
- Introduces lenient composition
- Frank Satta 1998
- Prove that OT is regular if of violations is
bounded.
7Comparisons
8Finnish OT Prosody
- Lauri Karttunen
- CLS-41
- April 7, 2005
9Finnish Prosody basic facts
- The nucleus of a Finnish syllable must consist of
a short vowel, a long vowel, or a diphthong. - Main stress is always on the first syllable,
secondary stress occurs on non-initial syllables. - Adjacent syllables are never stressed.
- Stressed syllable is initial in the foot.
- ilmoittautuminen registering (Nom Sg)
- (íl.moit).(tàu.tu).(mì.nen)
10Ternary feet in Finnish
- Stress that would fall on a light syllable shifts
on the following heavy syllable creating a
ternary foot. - (ká.las).te.(lèm.me) we are fishing
- (íl.moit).(tàu.tu).mi.(sès.ta) registering (Ela
Sg) - (rá.kas).ta.(jàt.ta).ri.(àn.sa) his mistresses
(Par Pl) - Can we get these facts to come out for free,
from the interaction of independently motivated
principles? - Yes!
- Paul Kiparsky Finnish Noun Inflection
Generative Approaches to Finnic and Saami
Linguistics, Diane Nelson and Satu Manninen
(eds.), pp.109-161, CSLI Publications, 2003. - Nine Elenbaas and René Kager. "Ternary rhythm and
the lapse constraint". Phonology 16. 273-329.
11Non-OT and OT solutions
- It is possible to define a cascade of replace
rules that produce the desired result. - http//www.stanford.edu/laurik/fsmbook/examples/F
innishProsody.html - But, following Kiparsky, we are going to do OT
today, and in a more elegant way than is shown
at - http//www.stanford.edu/laurik/fsmbook/examples/F
innishOTProsody.html
12Prelude Built-in Functions in fst
- Case conversion
- UpCase( OptUpCase(
- DownCase( OptDownCase(
- Cap( OptCap(
- AnyCase(
- Cap(hello) is equivalent to Hello
- OptUpCase(ab, L) is equivalent to aB ab
-
- Symbol manipulation
- Explode( Implode(
- regex Explode("Test") is equivalent to regex
Test
13Functions User-defined
- The function definition is attached to a symbol
ending with ( - The definition is any regular expression.
- There may be any number of arguments.
- define Redup(X) X X
- define Apply(X, Y) X .o. Y.l
- When the function is used in a regular
expression, the arguments are bound and the
function is evaluated. - regex Apply(abc, a -gt x _ b)
- print words
- xbc
- The definition of a function may contain other
functions.
14Pig Latin
- This script creates a function for translating
from English to Pig Latin - pig -gt igpay, brown -gt ownbray, script -gt
iptscray
define C bcdfghjklmnpqrstvwxy
z define V aeiou
define Redup(X) X "." X define DelCons(X) X
.o. C _at_-gt 0 .. _ define TailToAy(X) X
.o. V ? _at_-gt ay "." C _ define
DelMiddle(X) X .o. "." -gt 0
define Pig(X) DelMiddle(TailToAy(DelCons(Redup(X)
)))
15Demo!
16Computing with OT
By what finite-state operation?
17Priority union .P.
All pairs from R and those pairs from Q that do
not conflict with the mapping established by R.
R .P. Q R R.u .o. Q
Kaplan 1987
18Lenient Composition .O.
- Let R be a relation that maps each input string
to one or more outputs. - Let C be a constraint that eliminates some
outputs. - R .O. C is the relation that maps each input
string that can meet the constraint C to the
outputs that meet C and leaves the rest of the
relation R unchanged. (Karttunen 1998) - R .O. C R .o. C .P. R
- Is constraint ranking rule ordering in disguise?
Yes.
19Need a prolific GEN
- ka.la
- ka.lá
- ka.là
- ka.(là)
- ka.(lá)
- ká.la
- ká.lá
- ká.là
- ká.(là)
- ká.(lá)
- kà.la
- (kà.la)
- (ká).la
- (ká).lá
- (ká).là
- (ká).(là)
- (ká).(lá)
- (ká.là)
- (ká.lá)
- (ká.la) ?
- (ka.là)
- (ka.lá)
kà.lá kà.là kà.(là) kà.(lá) (kà).la (kà).lá (kà).
là (kà).(là) (kà).(lá) (kà.là) (kà.lá)
kala fish (Nom Sg) 33 candidates
20Basic definitions 1
- Using Parc/XRCE regular expression syntax
- define C b c d f g h j k l m
- n p q r s t v w x z
Consonant - define HighV u y i High
vowel - define MidV e o ö Mid
vowel - define LowV a ä Low
vowel - define USV HighV MidV LowV
Unstressed Vowel - define MSV á é í ó ú ý ä ö
- define SSV à è ì ò ù y ä ö
- define SV MSV SSV Stressed
vowel - define V USV SV Vowel
21Basic definitions 2
- define P V C
Phone - define B \P ..
Boundary - define E .. "."
Edge - define Light C V
Light syllable - define Heavy Light P
Heavy syllable - define S Heavy Light Syllable
- define SS S SV Stressed
syllable - define US S SV Unstressed
syllable - define MSS S MSV Syllable with
main stress
22GEN 1
- define MarkNonDiphthongs
- . . -gt "." HighV MidV _ LowV, i.a,
e.a - LowV _ MidV, a.e
- i _ MidV - e,
i.o, i.ö - u _ MidV - o,
u.e - y _ MidV - ö,
y.e - V i _ e,
poiki.en - V u _ o,
- V y _ ö
- Insert a syllable boundary between vowels that
cannot form - a diphtong i.a, e.a, a.e, i.o, u.e, y.e, etc.
- define Syllabify C V C _at_-gt ... "." _ C V
- Insert a syllable boundary after a maximal C
V C pattern that is followed by C V. For
example, strukturalismi -gt struk.tu.ra.lis.mi.
23GEN 2
- define Stress a (-gt) áà, e (-gt) éè, i (-gt) íì,
- o (-gt) óò, u (-gt) úù, y (-gt)
"y""y", - ä (-gt) "ä""ä", ö (-gt) "ö""ö"
- Optionally stress any vowel with a primary or
secondary stress. - define Scan S ("." S ("." S)) SS (-gt) "("
... ")" E _ E - Optionally group syllables into unary, binary, or
ternary feet when there is at least one stressed
syllable. - define Gen MarkNonDiphthongs .o. Syllabify .o.
- Stress .o. Scan
24Demo!
- fst -utf8 -l gen.script
- regex kala .o. Gen (compose)
- print lower-words (show output candidates)
- print size (count them)
25Kiparsky's nine constraints
- Clash
- AlignLeft
- MainStress
- FootBin
- Lapse
- NonFinal
- StressToWeight
- Parse
- AllFeetFirst
26Counting constraint violations
- We use asterisks to mark constraint violations.
We need a way to prefer candidates with the least
number of violation marks. - define Viol
- define Viol0 Viol No violations
- define Viol1 Viol2 At most one violation
- define Viol2 Viol3 At most two violations
- define Viol3 Viol4
- This eliminates the violation marks after the
candidate set has been pruned by a constraint. - define Pardon -gt 0
27Defining OT Constraints
- Three types
- Unviolable constraints
- Primary stress in Finnish
- Ordinary violable constraints
- Lapse
- Gradient alignment constraints
- All-Feet-First
- Strategy
- We define an evaluation template for each of the
three types and then define the individual
constraints with the help of the templates.
28Evaluation Template for Unviolable Constraints
- define Unviolable(Candidates, Constraint)
- Candidates
- .o.
- Constraint
- Example
- define MainStress(X) Unviolable(X, B MSS MSS)
- B is the left edge of the word or "(".
- MSS is a syllable with a primary stress.
29Evaluation Template for Ordinary Constraints
- define Eval(Candidates, Violation, Left, Right)
- Candidates
- .o.
- Violation -gt ... Left _ Right
- .O.
- Viol3 .O. Viol2 .O. Viol1 .O. Viol0
- .o.
- Pardon
- where Viol0 is , Viol2 is 2, etc.
and - Pardon is -gt 0 deleting all violation marks.
30Evaluation Template for Left-Oriented Gradient
Alignment
- define EvalGradientLeft(Candidates, Violation,
Left, Right) - Candidates .o.
- Violation -gt ... .. Left _ Right
- .o.
- Violation -gt 2 ... .. Left2 _ Right
- .o.
- Violation -gt 3... .. Left3 _ Right
- .o.
- Violation -gt 4 ... .. Left4 _ Right
- .o.
- Violation -gt 5 ... .. Left5 _ Right
- .o.
- Violation -gt 6 ... .. Left6 _ Right
- .o.
- Violation -gt 7 ... .. Left7 _ Right
- .o.
- Violation -gt 8 ... .. Left8 _ Right
- .O.
- Viol12 .O. Viol11 .O. Viol10 .O. Viol9 .O.
Viol8 .O. Viol7 .O.
31Clash, AlignLeft, MainStress
- Clash
- No stress on adjacent syllables.
- define Clash(X) Eval(X, SS, SS B, ?)
- Align-Left
- The stressed syllable is initial in the foot.
- define AlignLeft(X) Eval(X, SV, .. ? "(" C,
?) - Main Stress
- The primary stress in Finnish is on the first
syllable. - define MainStress(X) Unviolable(X, B MSS MSS)
32FootBin, Lapse, NonFinal
- Foot-Bin
- Feet are minimally bimoraic and maximally
bisyllabic. - define FootBin(X) Eval(X, "( Light ") "
("S"." Sgt1, - ? ,?)
- Lapse
- Every unstressed syllable must be adjacent to a
stressed syllable or to the word edge. - define Lapse(X) Eval(X, US, B US B, B US B)
- Non-Final
- The final syllable is not stressed.
- define NonFinal(X) Eval(X, SS, ?, S ..)
33StressToWeight, Parse, AllFeetFirst
- Stress-To-Weight
- Stressed syllables are heavy.
- define StressToWeight(X) Eval(X, SS Light, ?,
")" E) - License-s
- Syllables are parsed into feet.
- define Parse(X) Eval(X, S, E, E)
- All-Ft-Left
- The left edge of every foot coincides with the
left edge of some prosodic word. - define AllFeetFirst(X)
- EvalGradientLeft(X, "(", ".", ?)
34Finnish Prosody
- Kiparsky 2003
- define FinnishProsody(Input)
- AllFeetFirst( Parse( StressToWeight(
- NonFinal( Lapse( FootBin( MainStress(
- AlignLeft( Clash( Input .o. Gen)))))))))
35FinnWords
- regex FinnishProsody( kalastelet
kalasteleminen - ilmoittautuminen järjestelmättömyydestänsä
- kalastelemme ilmoittautumisesta
- järjestelmällisyydelläni järjestelmällistämä
töntä - voimisteluttelemasta opiskelija
opettamassa - kalastelet strukturalismi
onnittelemanikin - mäki perijä repeämä ergonomia
- puhelimellani matematiikka
puhelimistani - rakastajattariansa kuningas
kainostelijat - ravintolat merkonomin )
- Demo!
36Result
- (ér.go).(nò.mi).a
- (íl.moit).(tàu.tu).mi.(sès.ta)
- (íl.moit).(tàu.tu).(mì.nen)
- (ón.nit).(tè.le).(mà.ni).kin
- (ó.pis).(kè.li).ja
- (ó.pet).ta.(màs.sa)
- (vói.mis).te.(lùt.te).le.(màs.ta)
- (strúk.tu).ra.(lìs.mi)
- (rá.vin).(tò.lat)
- (rá.kas).ta.(jàt.ta).ri.(àn.sa)
- (ré.pe).(ä.mä)
- (pé.ri).jä
- (pú.he).li.(mèl.la).ni
- (pú.he).li.(mìs.ta).ni
- (mä.ki)
- (má.te).ma.(tìik.ka)
- (mér.ko).(nò.min)
- (kái.nos).(tè.li).jat
- (ká.las).te.(lèm.me)
- (ká.las).te.(lè.mi).nen
- (ká.las).(tè.let)
- (kú.nin).gas
- (jär.jes).tel.(mäl.li).syy.(dèl.lä).ni
- (jär.jes).(tèl.mät).tö.(myy.des).(tän.sä)
- (jär.jes).(tèl.mäl).(lìs.tä).mä.(tön.tä)
37Two Errors
- (ká.las).te.(lè.mi).nen
- (jär.jes).tel.(mäl.li).syy.(dèl.lä).ni
- The interaction of Lapse and StressToWeight does
not produce the desired result in these cases.
38What is wrong?
- define Debug(Input)
- DebugStressToWeight(
- NonFinal( Lapse( FootBin( MainStress(
AlignLeft( - Clash( Input .o. Gen)))))))
- regex Debug(kalasteleminen)
- (ká.las).te.(lè.mi).nen lt-- actual winner
- (ká.las).(tè.le).(mì.nen) lt-- desired output
- (jär.jes).tel.(mäl.li).syy.(dèl.lä).ni lt--
actual winner - (jär.jes).(tèl.mäl).li.(syy.del).(lä.ni) lt--
desired output - The StressToWeight constraint eliminates some of
the desired winning candidates.
39Nine Elenbaas
- A unified account of binary and ternary stress.
Ph.D. dissertation. University of Utrecht. 1999.
Based on KiparskyHanson 1996. The only
difference is that Elenbaas has a special
constraint (L H) or AntiLStressH( in place of
Kiparskys more general StressToWeight
constraint. - define FinnishProsody(Input)
- AllFeetFirst( Parse( AntiLStressH(
- NonFinal( Lapse( AlignLeft( FootBin(
- MainStress( Clash( Input .o. Gen)))))))))
- define AntiLStressH(X) Eval(X, SS Light, "(" ,
"." Heavy)
40Result
- (ér.go).(nò.mi).a
- (íl.moit).(tàu.tu).mi.(sès.ta)
- (íl.moit).(tàu.tu).(mì.nen)
- (ón.nit).(tè.le).(mà.ni).kin
- (ó.pis).(kè.li).ja
- (ó.pet).ta.(màs.sa)
- (vói.mis).te.(lùt.te).le.(màs.ta)
- (strúk.tu).ra.(lìs.mi)
- (rá.vin).(tò.lat)
- (rá.kas).ta.(jàt.ta).ri.(àn.sa)
- (ré.pe).(ä.mä)
- (pé.ri).jä
- (pú.he).li.(mèl.la).ni
- (pú.he).li.(mìs.ta).ni
- (mä.ki)
- (má.te).ma.(tìik.ka)
- (mér.ko).(nò.min)
- (kái.nos).(tè.li).jat
- (ká.las).te.(lèm.me)
- (ká.las).te.(lè.mi).nen
- (ká.las).(tè.let)
- (kú.nin).gas
- (jär.jes).(tèl.mäl).li.(syy.del).(lä.ni)
- (jär.jes).(tèl.mät).tö.(myy.des).(tän.sä)
- (jär.jes).(tèl.mäl).(lìs.tä).mä.(tön.tä)
41Did She Know?
Six syllables (Appendix of Elenbaas thesis) X X L
L L L áterìanàni áteriànani 'meal (Ess
1SG)' érgonòmiàna 'ergonomics
(Ess)' káinostèlijàna 'shy person
(Ess)' káinostèlijàni 'shy person (Nom
1SG)' kúnnallìsenàni 'council (Ess
1SG)' kúnnallìsiàni councils (Part
1SG)' kúnnallìsinàni 'councils (Ess
1SG)' mérkonòmiàni 'degree in economics (Part
1SG)' mérkonòminàni 'degree in economics (Ess
1SG)' ópiskèlijàni 'student (Nom
1SG)' púhelìmenàni 'telephone (Ess
1SG)' púhelìmiàni telephone (Part
1SG) Missing pattern X X L L L H
42Conclusion
- Can we get ternary feet in Finnish for free,
from the interaction of independently motivated
principles? - We dont know.
- We know that the Kiparsky and Elenbaas accounts
fail. - Optimality Prosody is computationally very
difficult. - The number of initial candidates is huge
- kalasteleminen 70653
- järjestelmällisyydelläni 21767579
- Simple tableau methods do not work.
- Finite-state implementation guards against errors
made by a human GEN and EVAL. - But even when an error can be pinpointed, the fix
is not obvious. - Debugging OT constraints is as hard as debugging
two-level rules, in practice more difficult than
rewrite systems.
43Final Thoughts
- Morphology is a regular relation.
- The composition of words (morphosyntax),
morphological alternations, and prosody can be
described in finite-state terms. - A complex relation can be decomposed in different
ways. - There are many flavors of finite-state
morphology Item-and-Arrangement, Rewrite rules,
Two-level rules, Realizational Morphology,
Classical optimality constraints. - Computing with finite-state tools is fun and
easy. - We have sophisticated formalism for describing
regular relations, efficient compilers and
runtime software. - Pen-and-pencil morphology badly needs
computational support. - It is difficult to get globally correct results
relying on a handful of interesting words, rules,
and constraints.