Title: Finite-State Methods in Natural Language Processing
1Finite-State Methods in Natural Language
Processing
- Lauri Karttunen
- LSA 2005 Summer Institute
- August 1, 2005
2- August 1
- Non-concatenative morphotactics
- Reduplication, interdigitation
- Realizational morphology
- Readings
- Chapter 8. Non-Concatenative Morphotactics
- Gregory T. Stump. Inflectional Morphology. A
Theory of Paradigm Structure. Cambridge U. Press.
2001. (An excerpt) - Lauri Karttunen, Computing with Realizational
Morphology, Lecture Notes in Computer Science,
Volume 2588, Alexander Gelbukh (ed.), 205-216,
Springer Verlag. 2003. - August 3
- Optimality theory
- Readings
- Paul Kiparsky Finnish Noun Inflection
Generative Approaches to Finnic and Saami
Linguistics, Diane Nelson and Satu Manninen
(eds.), pp.109-161, CSLI Publications, 2003. - Nine Elenbaas and René Kager. "Ternary rhythm and
the lapse constraint". Phonology 16. 273-329.
3Morphotactics
- Most languages construct words by concatenating
morphemes in a strict order - books
- unthinkingly
- utamanacapjjasamachaiwa (Aymara)
- (utamancapjjasamachiwa it looks like they are
in your house) - parismutnngauniraqlauqsimanngitjunga
(Inuktitut) - (parimunngauniralauqsimanngittunga I never said
I wanted to go to Paris) - Many languages also have non-concatenative
processes of word-formation - reduplication (Malay)
- interdigitation (Arabic)
4Weakness of traditional finite-state
morphotactics
- Two-level and finite-state morphology have been
widely criticized for handling only concatenative
morphotactics. - Only restricted infixation and reduplication can
be handled adequately with the present system.
Some extensions or revisions will be necessary
for an adequate description of languages
possessing extensive infixation or
reduplication. (Koskenniemi, 1983, p. 27)
5Interdigitation in Arabic
- Concatenative kuutib a
-
stem suffix - Non-concatenative stem ktb CVVCVC ui
- root template vocalization
-
- kuutib
Informally speaking, the root, template and
vocalization morphemes interdigitate into a
stem.
6Full-Stem Reduplication in Malay
- In Malay, the overt plural of bagi (suitcase)
is bagibagi (orthographically bagi-bagi) the
plural of peraturan (rule) is
peraturanperaturan, etc. - To model such pluralization, you need to copy the
stem, no matter what it is and no matter how long
it is. - Such full-stem reduplication appears to be far
beyond finite-state power - The copy language ww is context-sensitive.
7A new algorithm compile-replace
- Define networks using concatenation, as before,
but in such a way that the paths in the network
may themselves contain regular expressions. - Reapply the compiler to its own output, compiling
the regular expression substrings and replacing
them with the result of the compilation.
8A non-linguistic example before compile-replace
Network containing a regular expression,
a delimited with and .
9Non-linguistic example after compile-replace
Maps every string in the infinite a language to
the regular expression from which the language
was compiled.
10Iteration operator
- n
- A2 denotes two concatenations of the language A
with itself, equivalent to A A. - A bagi, pelanbuhan,
- A2 bagibagi, bagipelanbuhan, pelanbuhanbagi,
pelanbuhanpelanbuhan. - Finite-state languages and relations are closed
under n-ary concatenation.
11Solution for Malay
- Construct the basic lexicon with paths such as
- Lexical string b a g i Noun Plural
- Surface string b a g i 2
- Lexical string p e l a b u h a n Noun
Plural - Surface string p e l a b u h a n
2 - Apply compile-replace on the lower side of the
network.
12Compile-replace before and after
Before Lexical string b a g i Noun
Plural Surface string b a g i
2 After Lexical string b a g i
Noun Plural Surface string b a g i b a g
i The compile replace operation does not create
any ill-formed reduplicates such as pelabuhanbagi.
13Caveats
- The Malay solution, the use of 2 as the
reduplication operator, is for - full-stem reduplication
- identity between the base and the reduplicate
- There are other types of reduplication
- partial reduplication
- partial non-identity
- This is a hot research topic.
14Partial Reduplication
- Agta ( Assignment 3)
- t a k k i Pl
- t a k k i Pl
- .o.
- C V C _at_-gt "" ... "" 2 "" _ "Pl"
- t a k k i Pl
- t a k 2 k i
- compile-replace lower
- t a k Pl
- t a k t a k k i
15Arabic stem interdigitation
- Two-lines of computational work
- Kay 1987, Kiraz 1994, 2000
- Inspired by McCarthy 1981
- Separate tiers for root, pattern, and vocalism
- Requires a special mechanism for constructing a
fourth tier for the stem - KatajaKoskenniemi 1988, Beesley 1989, 1991, 1996
- Tripartite representation on a single tier
- Use intersection to combine the three components
16Interdigitation Formalized as Intersection
- Let root ktb be formalized as ? k ? t ? b
?, equivalent to ktb/?. - Let template CVVCVC be formalized as CVVCVC,
where C is defined as the union of all consonants
and V as the union of all vowels. - Let vocalization ui be formalized as ui/\V.
- Then stems can be formed via finite-state
intersection rather than concatenation - ktb/? CVVCVC ui/\V kuutib
- The string kuutib is the only one simultaneously
satisfying all the constraints.
17merge, a faster intersection
- To model the morphotactics of Arabic, you need
union, concatenation and intersection. - Languages that require only union and
concatenation are just special cases. - The intersection required for Arabic stems is in
fact another special case ktb/? CVVCVC
ui/\V involves just fitting the consonants of
the root into the C slots and the vowels of the
vocalization into the V slots.
18Merge Operators
- .mgt. is the merge to the right operator and
- .ltm. is the merge to the left operator.
- xfst0 list C k t b d r s m n b t
- xfst0 list V a i u
- xfst0 read regex ktb .mgt. CVVCVC .ltm.
ui - xfst1 print words
- kuutib
19Solution for Arabic
- Construct the basic lexicon with paths such as
- Lexical k t b Root C V C V C Template a
Voc - Surface k t b .mgt. C V C V C .ltm. a
- Lexical d r s Root C V V C V C Template u
i Voc - Surface d r s .mgt. C V V C V C .ltm. u
i - Apply compile-replace on the lower side of the
network.
20Compile-replace before and after
Before Lexical k t b Root C V C V C
Template a Voc Surface k t b .mgt. C V C
V C .ltm. a After Lexical k t
b Root C V C V C Template a Voc Surface
k a t a b Alternation rules apply to
the interdigitated stems to produce the real
surface strings.
21Summary
- Flag diacritics make it possible to represent
long-distance constrains without blow-up in size - Compile-replace technique allows any finite-state
operation to be used in morphotactic description. - A special template filling operation, merge,
allows fast interdigitation in cases such as
Arabic.
22Computing with Realizational Morphology
23Overview
- A Puzzle
- Realizational Morphology (is finite-state)
- Lexical representations
- Realization rules
- Morphophonological rules
- Rules of referral
- Elsewhere principle (Panini's principle)
- Discussion
24 A Puzzle
- The success of computational morphology has not
made any impact within paper-and-pencil
linguistics. - Computational concerns
- completeness of coverage, physical size, speed of
application, formal power, complexity of
algorithms - Academic concerns
- explanation, universal principles,
generalizations, theoretical predictions, elegant
formalism, - Theoretical Issues
- tags (Accusative) vs. features (Case
Accusative) - commitment to morphemes?
25Realizational Morphology
- Gregory Stump, Inflectional Morphology. A Theory
of Paradigm Structure. Cambridge U. Press. 2001. - No morphemes! (No fixed meaning-sound pairs)
- A rich set of notational conventions designed to
capture important linguistic generalizations. - Interpretable, precise formalism.
- Computational implementation in DATR (Finkel
Stump 2002). - The good news Realizational morphology is a
finite-state model.
26Finite-state advantage
- Casting Stump's system into a regular expression
formalism that has a compiler has a fundamental
advantage over implementation in systems such as
DATR. - DATR can be used to generate an inflected surface
form from its lexical representation but it is
not directly usable for recognition. In contrast,
finite-state transducers are bidirectional
generator/recognizers. - Issues to be addressed
- Lexical representations
- Realization rules ( rules of exponence)
- Morphophonological rules
- Rules of referral
- Rule ordering by general principles
27Lexical representation
A phonological representation
A set of morphological properties
28Realization rule
phonological input
phonological output
features
- RRn,t,C(ltX,sgt) def ltY', sgt
rule block
features realized by the rule
category
29Rule application
- Realization rules are ordered into blocks by the
linguist. - Within blocks, the ordering is determined by
specificity (Elsewhere rule, Panini's principle). - The final output of a realization rule may depend
on morphophonological rules. - X " Y " Y'
30Cascade of rule applications
ltbet, SubPer1, NumSg, ObjPer2, NumSg,
TnsPastRecgt
31Observations
- The lexical representations of Realizational
Morphology constitute a regular language. - They can be described by a regular expression.
- All examples of realization rules given in
Stump's book represent regular relations. - They can be compiled compiled into finite-state
transducers. - Because regular relations are closed under
composition, the cascade of rule applications
yields a single transducer. - We can eliminate the features from the surface
side once the composition has been done.
32Literal example
In a real application, one would prefer a more
parsimonious encoding of the feature structure.
33Realization rules
- Stump's realization rules can easily be expressed
in Parc/XRCE regular expression formalism. - Example
- RR3, ObjPer2, NumSg, V(ltX,sgt) def ltkoX, sgt
- define R301 . . -gt ko "lt" _ ObjAgr 2
Sg - "Rule R301 Insert ( rewrite the empty string
as) "ko" - to the beginning of a phonological form whose
object - agreement features contain the values 2 and Sg."
34Morphophonological rules
- The output of a realization rule may be subject
to a morphophonological rule. - Stump's morphophonemic rules are simple rewrite
rules, easily expressed in the Parc/XRCE regular
expression formalism. - If XWvowel1 and YXvowel2Z, then the
indicated vowel2 is absent from Y'. - Vowel -gt 0 Vowel "" _
- where "" marks the place where the suffix is
inserted.
35Rules of referral
- Realization rules may be defined in terms of
other realization rules. - The same affix can express more than one bundle
of morphological features (syncretism). - In Lingala, mo expresses class 4 singular 3rd
person agreement for subjects and objects. - In the Parc/XRCE regular expression formalism, a
rule of referral corresponds to a substitution
operation. - If R305 is the object agreement rule, the
corresponding subject agreement rule is - R305, Obj, Sub
- It yields a transducer identical to R305 except
that the insertion of mo is controlled by subject
agreement features.
36Elsewhere principle
- While the rule blocks are ordered by the
linguist, the the realization rules within each
block and the morphophonological rules are
ordered by specificity. - A specific rule takes precedence over a more
general rule in cases where both are applicable. - This principle is very important for Stump who
calls it "Panini's principle". But he gives no
precise definition for it within his formalism. - The Elsewhere Principle is an extremely simple
notion for realization rules and for
symbol-to-symbol morphophonological rules.
37Specific vs. General
38Input/Output languages
- Rule A and Rule B have the same input language
the universal language. - Both rules can be applied without failure to any
string. If the context is not met, the output is
the same as the input. - The output languages are not the same. A
"successful" application an obligatory rule
removes from the output language the strings to
which it has applied. - Every string missing from the output language of
Rule B is missing from the output language of
Rule A, but not vice versa. - The output language of Rule A is a proper subset
of the output language of Rule B.
39Output language of Rule A
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
40Output language of Rule B
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
41Principled rule ordering
- The relationship of any two rules A and B that
have been compiled into transducers can be
determined by the following method - Extract the output languages (a finite-state
operation). - Check whether one is the proper subset of the
other (a finite-state operation). - This determination can be done efficiently and
without any knowledge of how the rules were
expressed.
42Discussion
- It is evident that Realizational Morphology is
yet another variant of finite-state morphology. - Stump could say "Your theory is a notational
variant of mine but mine is better." - There are many examples where notation matters
- B gt A _ C "B must occur between A and C."
- ? A B ? ? B C ?
- Stump's convoluted and cumbersome notation takes
no advantage of the nice formal and computational
properties that it in fact has.
43Reflections
- Computational morphology and paper-and-pencil
morphology have a curious non-relationship going
back at least 30 years. - Time after time computational knights have
presented themselves at the Court of Linguistics,
rushed up to the Princess of Phonology and
Morphology in great excitement to deliver the
same message.
44At the Court of Linguistics
- Knight
- "Dear Princess. I have wonderful news for you.
You are not like some of you NP-complete sisters.
You are regular. You are rational. You are
finite-state. Please marry me. Together we can do
great things." - Princess
- "Not interested. You don't understand theory. Go
away you geek." - Innocent little boy
- "The Princess has no clothes. The Princess has no
clothes"
45Scheduling (a Princess effect?)
- 450-630 MW LSA.306 Introduction to Morphology
- 450-630 MW LSA.207 Finite-State Methods in
Natural Language Processing