Finite-State Methods in Natural Language Processing - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Finite-State Methods in Natural Language Processing

Description:

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 1, 2005 – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 46
Provided by: ern98
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Finite-State Methods in Natural Language Processing


1
Finite-State Methods in Natural Language
Processing
  • Lauri Karttunen
  • LSA 2005 Summer Institute
  • August 1, 2005

2
  • August 1
  • Non-concatenative morphotactics
  • Reduplication, interdigitation
  • Realizational morphology
  • Readings
  • Chapter 8. Non-Concatenative Morphotactics
  • Gregory T. Stump. Inflectional Morphology. A
    Theory of Paradigm Structure. Cambridge U. Press.
    2001. (An excerpt)
  • Lauri Karttunen, Computing with Realizational
    Morphology, Lecture Notes in Computer Science,
    Volume 2588, Alexander Gelbukh (ed.), 205-216,
    Springer Verlag. 2003.
  • August 3
  • Optimality theory
  • Readings
  • Paul Kiparsky Finnish Noun Inflection
    Generative Approaches to Finnic and Saami
    Linguistics, Diane Nelson and Satu Manninen
    (eds.), pp.109-161, CSLI Publications, 2003.
  • Nine Elenbaas and René Kager. "Ternary rhythm and
    the lapse constraint". Phonology 16. 273-329.

3
Morphotactics
  • Most languages construct words by concatenating
    morphemes in a strict order
  • books
  • unthinkingly
  • utamanacapjjasamachaiwa (Aymara)
  • (utamancapjjasamachiwa it looks like they are
    in your house)
  • parismutnngauniraqlauqsimanngitjunga
    (Inuktitut)
  • (parimunngauniralauqsimanngittunga I never said
    I wanted to go to Paris)
  • Many languages also have non-concatenative
    processes of word-formation
  • reduplication (Malay)
  • interdigitation (Arabic)

4
Weakness of traditional finite-state
morphotactics
  • Two-level and finite-state morphology have been
    widely criticized for handling only concatenative
    morphotactics.
  • Only restricted infixation and reduplication can
    be handled adequately with the present system.
    Some extensions or revisions will be necessary
    for an adequate description of languages
    possessing extensive infixation or
    reduplication. (Koskenniemi, 1983, p. 27)

5
Interdigitation in Arabic
  • Concatenative kuutib a

  • stem suffix
  • Non-concatenative stem ktb CVVCVC ui
  • root template vocalization
  • kuutib

Informally speaking, the root, template and
vocalization morphemes interdigitate into a
stem.
6
Full-Stem Reduplication in Malay
  • In Malay, the overt plural of bagi (suitcase)
    is bagibagi (orthographically bagi-bagi) the
    plural of peraturan (rule) is
    peraturanperaturan, etc.
  • To model such pluralization, you need to copy the
    stem, no matter what it is and no matter how long
    it is.
  • Such full-stem reduplication appears to be far
    beyond finite-state power
  • The copy language ww is context-sensitive.

7
A new algorithm compile-replace
  • Define networks using concatenation, as before,
    but in such a way that the paths in the network
    may themselves contain regular expressions.
  • Reapply the compiler to its own output, compiling
    the regular expression substrings and replacing
    them with the result of the compilation.

8
A non-linguistic example before compile-replace
Network containing a regular expression,
a delimited with and .
9
Non-linguistic example after compile-replace
Maps every string in the infinite a language to
the regular expression from which the language
was compiled.
10
Iteration operator
  • n
  • A2 denotes two concatenations of the language A
    with itself, equivalent to A A.
  • A bagi, pelanbuhan,
  • A2 bagibagi, bagipelanbuhan, pelanbuhanbagi,
    pelanbuhanpelanbuhan.
  • Finite-state languages and relations are closed
    under n-ary concatenation.

11
Solution for Malay
  • Construct the basic lexicon with paths such as
  • Lexical string b a g i Noun Plural
  • Surface string b a g i 2
  • Lexical string p e l a b u h a n Noun
    Plural
  • Surface string p e l a b u h a n
    2
  • Apply compile-replace on the lower side of the
    network.

12
Compile-replace before and after
Before Lexical string b a g i Noun
Plural Surface string b a g i
2 After Lexical string b a g i
Noun Plural Surface string b a g i b a g
i The compile replace operation does not create
any ill-formed reduplicates such as pelabuhanbagi.
13
Caveats
  • The Malay solution, the use of 2 as the
    reduplication operator, is for
  • full-stem reduplication
  • identity between the base and the reduplicate
  • There are other types of reduplication
  • partial reduplication
  • partial non-identity
  • This is a hot research topic.

14
Partial Reduplication
  • Agta ( Assignment 3)
  • t a k k i Pl
  • t a k k i Pl
  • .o.
  • C V C _at_-gt "" ... "" 2 "" _ "Pl"
  • t a k k i Pl
  • t a k 2 k i
  • compile-replace lower
  • t a k Pl
  • t a k t a k k i

15
Arabic stem interdigitation
  • Two-lines of computational work
  • Kay 1987, Kiraz 1994, 2000
  • Inspired by McCarthy 1981
  • Separate tiers for root, pattern, and vocalism
  • Requires a special mechanism for constructing a
    fourth tier for the stem
  • KatajaKoskenniemi 1988, Beesley 1989, 1991, 1996
  • Tripartite representation on a single tier
  • Use intersection to combine the three components

16
Interdigitation Formalized as Intersection
  • Let root ktb be formalized as ? k ? t ? b
    ?, equivalent to ktb/?.
  • Let template CVVCVC be formalized as CVVCVC,
    where C is defined as the union of all consonants
    and V as the union of all vowels.
  • Let vocalization ui be formalized as ui/\V.
  • Then stems can be formed via finite-state
    intersection rather than concatenation
  • ktb/? CVVCVC ui/\V kuutib
  • The string kuutib is the only one simultaneously
    satisfying all the constraints.

17
merge, a faster intersection
  • To model the morphotactics of Arabic, you need
    union, concatenation and intersection.
  • Languages that require only union and
    concatenation are just special cases.
  • The intersection required for Arabic stems is in
    fact another special case ktb/? CVVCVC
    ui/\V involves just fitting the consonants of
    the root into the C slots and the vowels of the
    vocalization into the V slots.

18
Merge Operators
  • .mgt. is the merge to the right operator and
  • .ltm. is the merge to the left operator.
  • xfst0 list C k t b d r s m n b t
  • xfst0 list V a i u
  • xfst0 read regex ktb .mgt. CVVCVC .ltm.
    ui
  • xfst1 print words
  • kuutib

19
Solution for Arabic
  • Construct the basic lexicon with paths such as
  • Lexical k t b Root C V C V C Template a
    Voc
  • Surface k t b .mgt. C V C V C .ltm. a
  • Lexical d r s Root C V V C V C Template u
    i Voc
  • Surface d r s .mgt. C V V C V C .ltm. u
    i
  • Apply compile-replace on the lower side of the
    network.

20
Compile-replace before and after
Before Lexical k t b Root C V C V C
Template a Voc Surface k t b .mgt. C V C
V C .ltm. a After Lexical k t
b Root C V C V C Template a Voc Surface
k a t a b Alternation rules apply to
the interdigitated stems to produce the real
surface strings.
21
Summary
  • Flag diacritics make it possible to represent
    long-distance constrains without blow-up in size
  • Compile-replace technique allows any finite-state
    operation to be used in morphotactic description.
  • A special template filling operation, merge,
    allows fast interdigitation in cases such as
    Arabic.

22
Computing with Realizational Morphology
  • Lauri Karttunen

23
Overview
  • A Puzzle
  • Realizational Morphology (is finite-state)
  • Lexical representations
  • Realization rules
  • Morphophonological rules
  • Rules of referral
  • Elsewhere principle (Panini's principle)
  • Discussion

24
A Puzzle
  • The success of computational morphology has not
    made any impact within paper-and-pencil
    linguistics.
  • Computational concerns
  • completeness of coverage, physical size, speed of
    application, formal power, complexity of
    algorithms
  • Academic concerns
  • explanation, universal principles,
    generalizations, theoretical predictions, elegant
    formalism,
  • Theoretical Issues
  • tags (Accusative) vs. features (Case
    Accusative)
  • commitment to morphemes?

25
Realizational Morphology
  • Gregory Stump, Inflectional Morphology. A Theory
    of Paradigm Structure. Cambridge U. Press. 2001.
  • No morphemes! (No fixed meaning-sound pairs)
  • A rich set of notational conventions designed to
    capture important linguistic generalizations.
  • Interpretable, precise formalism.
  • Computational implementation in DATR (Finkel
    Stump 2002).
  • The good news Realizational morphology is a
    finite-state model.

26
Finite-state advantage
  • Casting Stump's system into a regular expression
    formalism that has a compiler has a fundamental
    advantage over implementation in systems such as
    DATR.
  • DATR can be used to generate an inflected surface
    form from its lexical representation but it is
    not directly usable for recognition. In contrast,
    finite-state transducers are bidirectional
    generator/recognizers.
  • Issues to be addressed
  • Lexical representations
  • Realization rules ( rules of exponence)
  • Morphophonological rules
  • Rules of referral
  • Rule ordering by general principles

27
Lexical representation
  • lt Stem, Featuresgt

A phonological representation
A set of morphological properties
28
Realization rule
phonological input
phonological output
features
  • RRn,t,C(ltX,sgt) def ltY', sgt

rule block
features realized by the rule
category
29
Rule application
  • Realization rules are ordered into blocks by the
    linguist.
  • Within blocks, the ordering is determined by
    specificity (Elsewhere rule, Panini's principle).
  • The final output of a realization rule may depend
    on morphophonological rules.
  • X " Y " Y'

30
Cascade of rule applications
ltbet, SubPer1, NumSg, ObjPer2, NumSg,
TnsPastRecgt
31
Observations
  • The lexical representations of Realizational
    Morphology constitute a regular language.
  • They can be described by a regular expression.
  • All examples of realization rules given in
    Stump's book represent regular relations.
  • They can be compiled compiled into finite-state
    transducers.
  • Because regular relations are closed under
    composition, the cascade of rule applications
    yields a single transducer.
  • We can eliminate the features from the surface
    side once the composition has been done.

32
Literal example
In a real application, one would prefer a more
parsimonious encoding of the feature structure.
33
Realization rules
  • Stump's realization rules can easily be expressed
    in Parc/XRCE regular expression formalism.
  • Example
  • RR3, ObjPer2, NumSg, V(ltX,sgt) def ltkoX, sgt
  • define R301 . . -gt ko "lt" _ ObjAgr 2
    Sg
  • "Rule R301 Insert ( rewrite the empty string
    as) "ko"
  • to the beginning of a phonological form whose
    object
  • agreement features contain the values 2 and Sg."

34
Morphophonological rules
  • The output of a realization rule may be subject
    to a morphophonological rule.
  • Stump's morphophonemic rules are simple rewrite
    rules, easily expressed in the Parc/XRCE regular
    expression formalism.
  • If XWvowel1 and YXvowel2Z, then the
    indicated vowel2 is absent from Y'.
  • Vowel -gt 0 Vowel "" _
  • where "" marks the place where the suffix is
    inserted.

35
Rules of referral
  • Realization rules may be defined in terms of
    other realization rules.
  • The same affix can express more than one bundle
    of morphological features (syncretism).
  • In Lingala, mo expresses class 4 singular 3rd
    person agreement for subjects and objects.
  • In the Parc/XRCE regular expression formalism, a
    rule of referral corresponds to a substitution
    operation.
  • If R305 is the object agreement rule, the
    corresponding subject agreement rule is
  • R305, Obj, Sub
  • It yields a transducer identical to R305 except
    that the insertion of mo is controlled by subject
    agreement features.

36
Elsewhere principle
  • While the rule blocks are ordered by the
    linguist, the the realization rules within each
    block and the morphophonological rules are
    ordered by specificity.
  • A specific rule takes precedence over a more
    general rule in cases where both are applicable.
  • This principle is very important for Stump who
    calls it "Panini's principle". But he gives no
    precise definition for it within his formalism.
  • The Elsewhere Principle is an extremely simple
    notion for realization rules and for
    symbol-to-symbol morphophonological rules.

37
Specific vs. General
38
Input/Output languages
  • Rule A and Rule B have the same input language
    the universal language.
  • Both rules can be applied without failure to any
    string. If the context is not met, the output is
    the same as the input.
  • The output languages are not the same. A
    "successful" application an obligatory rule
    removes from the output language the strings to
    which it has applied.
  • Every string missing from the output language of
    Rule B is missing from the output language of
    Rule A, but not vice versa.
  • The output language of Rule A is a proper subset
    of the output language of Rule B.

39
Output language of Rule A
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
40
Output language of Rule B
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
41
Principled rule ordering
  • The relationship of any two rules A and B that
    have been compiled into transducers can be
    determined by the following method
  • Extract the output languages (a finite-state
    operation).
  • Check whether one is the proper subset of the
    other (a finite-state operation).
  • This determination can be done efficiently and
    without any knowledge of how the rules were
    expressed.

42
Discussion
  • It is evident that Realizational Morphology is
    yet another variant of finite-state morphology.
  • Stump could say "Your theory is a notational
    variant of mine but mine is better."
  • There are many examples where notation matters
  • B gt A _ C "B must occur between A and C."
  • ? A B ? ? B C ?
  • Stump's convoluted and cumbersome notation takes
    no advantage of the nice formal and computational
    properties that it in fact has.

43
Reflections
  • Computational morphology and paper-and-pencil
    morphology have a curious non-relationship going
    back at least 30 years.
  • Time after time computational knights have
    presented themselves at the Court of Linguistics,
    rushed up to the Princess of Phonology and
    Morphology in great excitement to deliver the
    same message.

44
At the Court of Linguistics
  • Knight
  • "Dear Princess. I have wonderful news for you.
    You are not like some of you NP-complete sisters.
    You are regular. You are rational. You are
    finite-state. Please marry me. Together we can do
    great things."
  • Princess
  • "Not interested. You don't understand theory. Go
    away you geek."
  • Innocent little boy
  • "The Princess has no clothes. The Princess has no
    clothes"

45
Scheduling (a Princess effect?)
  • 450-630 MW LSA.306 Introduction to Morphology
  • 450-630 MW LSA.207 Finite-State Methods in
    Natural Language Processing
Write a Comment
User Comments (0)
About PowerShow.com