Integrating Finitestate Morphologies with Deep LFG Grammars - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Integrating Finitestate Morphologies with Deep LFG Grammars

Description:

eliminate the need to list (multiple) surface forms in the lexicon ... Intial capitals are optionally lower cased. The boy left. == the boy left. Mary left. ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 36
Provided by: franci9
Category:

less

Transcript and Presenter's Notes

Title: Integrating Finitestate Morphologies with Deep LFG Grammars


1
Integrating Finite-state Morphologies with Deep
LFG Grammars
  • Tracy Holloway King

2
FST and deep grammars
  • Finite state tokenizers and morphologies can be
    integrated into deep processing systems
  • Integrated tokenizers
  • eliminate the need for preprocessing
  • allow the grammar writer more control over the
    input
  • Morphologies
  • eliminate the need to list (multiple) surface
    forms in the lexicon
  • eliminate the need for lexical entries for words
    with predictable subcategorization frames

3
Talk outline
  • Basic integrated system
  • Integrating morphology FSTs
  • Interaction of tokenization and morphology

4
Basic Architecture
5
Example steps through the system
  • Input string Boys appeared.
  • Tokenizing boys TB appeared TB . TB
  • Morphology
  • boy Noun Pl
  • appear Verb PastBoth 123SP
  • . Punct
  • C-structure/F-structure next slides

6
C-structure tree
7
F-structure AVM
8
The wider system XLE
  • Handwritten grammars for various languages
  • Substantial for English, German, Japanese,
    Norwegian
  • Also Arabic, Chinese, Urdu, Korean, Welsh,
    Malagasy, Turkish
  • Robustness mechanisms
  • Fragment grammar rules
  • Morphological guessers
  • Skimming when resource limits approached
  • Ambiguity management (packing)
  • Compute all analyses (no aggressive pruning)
  • Propagate packed ambiguities across processing
    modules
  • Stochastic disambiguation
  • MaxEnt models to select from packed
    (f-)structures
  • Other processing available
  • generation, semantics, transfer/rewriting
  • Comparisons to other systems/tasks
  • Parsing WSJ (Riezler et al, ACL 2002)
  • Comparison to Collins model 3 (Riezler et al,
    NAACL 2004)

9
FST Morphologies
  • Associate surface form with
  • a lemma (stem/canonical form)
  • a set of tags
  • Process is non-deterministic
  • can have many analyses for one surface form
  • grammar has to be able to deal with multiple
    analyses (morphological ambiguity)
  • Issue can the grammar control rampant
    morphological ambiguity?
  • Arabic vowelless representations

10
Example Morphology Output
  • turnips turnip Noun Pl
  • Mary Mary Prop Giv Fem Sg
  • falls fall Noun Pl
  • fall Verb Pres 3sg
  • broken break Verb PastPerf 123SP
  • broken Verb PastPart
    Adj
  • New York
  • New York Prop Place USAState Prefer
  • New York Prop Place City Prefer
  • plus analyses of New and York

11
Morphologies and lexicons
  • Without a morphology, need to list all surface
    forms in the lexicon
  • bad for English
  • horrible for languages like Finnish and Arabic
  • With a morphology, one entry for the stem form
  • go V XLE _at_(V-INTRANS go).
  • for go, goes, going, gone, went
  • With additional integration, words with
    predictable subcategorization frames need no entry

12
Basic idea
  • Run surface forms of words through the morphology
    to produce stems and tags
  • MorphConfig file specifies which morphologies the
    grammar uses
  • Look up stems and tags in the lexicon
  • Sublexical phrase structure rules build syntactic
    nodes covering the stems and tags
  • Standard grammar rules build larger phrases

13
Lexical entries for tags
  • boys boy Noun Pl
  • boy N XLE _at_(NOUN boy).
  • Noun N_SFX XLE _at_(PERS 3)
  • _at_(EXISTS
    NTYPE).
  • Pl NNUM_SFX XLE _at_(NUM pl).

14
Sublexical rules for tags
  • Build up lexical nodes from stem plus tags
  • Rules are identical to standard phrase structure
    rules
  • Except display can hide the sublexical
    information
  • N -- N_BASE
  • N_SFX_BASE
  • NNUM_SFX_BASE.

15
Resulting structures
16
Lexical entries
  • Stems with unpredictable subcategorization frames
    need entries
  • verbs
  • adjectives with obliques (proud of her)
  • nouns with that complements (the idea that he
    laughed)
  • Most lexical items have predictable frames
    determined by part of speech
  • common and proper nouns
  • adjectives
  • adverbs
  • numbers

17
-unknown lexical entry
  • Match any stem to the entry
  • Provide desired functional information
  • stem will pass in the appropriate surface form
    (i.e., the lemma/stem)
  • Constrain application via morphological tag
    possibilities
  • -unknown N XLE _at_(NOUN stem)
  • A XLE _at_(ADJ stem)
  • ADV XLE _at_(ADVERB stem).

18
-unknown example
  • The box boxes.
  • Lexicon entries
  • box V XLE _at_(V-INTRANS stem).
  • -unknown N XLE _at_(NOUN stem) ADV A...
  • Morphology output
  • box box Noun Sg Verb Non3Sg
  • boxes box Noun Pl Verb 3Sg
  • Build up four effective lexical entries
  • 1 noun, 1 verb, 1 adverb, 1 adjective
  • adverb and adjective fail sublexically
  • noun and verb relevant for the sentence

19
Inflectional morphology summary
  • Integrating FST morphologies significantly
    decreases lexicon development
  • Verbs and other unpredictable items are listed
    only under their stem form
  • Predictable items such as nouns are processed via
    unknown and never listed in the lexicon

20
Guessers
  • Even large industrial FST morphologies are not
    complete
  • Novel words usually have regular morphology
  • Build and FST guesser based on this
  • Words with capital letters are proper nouns
    (Saakashvili)
  • Words ending in ed are past tense verbs or
    deverbal adjectives
  • Guessed words will go through unknown
  • no difference from standard morphological output
  • can add Guessed tag for further control

21
Guessers controlling application
  • Apply guesser in the grammar only if there is no
    form in the regular morphology
  • don't guess unless you have to
  • Control this with the MorphConfig
  • use multiple fst morphologies
  • stop looking once analysis if found

22
Sample MorphConfig
STANDARD ENGLISH MORPHOLOGY (1.0) TOKENIZE
english.tok.parse.fst ANALYZE USEFIRST
english.infl.fst try regular
morphology first english.guesser.fst
if fail, guess MULTIWORD english.standard.mwe.
fst
23
Multiple morphology FSTs
  • In addition to the regular morphology and
    guesser, can have other morphologies
  • morphology for technical terms, part numbers,
    etc.
  • These can be applied in sequence or in parallel
    (cascaded or unioned)
  • ANALYZE USEALL
  • english.infl.fst try regular
    morphology
  • english.eureka.parts.fst and also part names

24
Morphology vs. surface form
  • System always allows surface form through
  • Lexicon can match this form for
  • multiword expressions
  • override/supplement morphological analysis
  • Example or as adverb (Or you could leave now.)
  • or ADV _at_(ADVERB or)
  • CONJ XLE _at_(CONJ or).

25
Tokenizers
  • Tokenizers break strings (sentences) into tokens
    (words)
  • Need to (for English)
  • break off punctuation
  • Mary laughs. Mary TB laughs TB . TB
  • lower case certain letters
  • The dog the TB dog

26
Tokenization and morphology
  • Linguistic analysis may govern tokenization
  • Are English contracted auxiliaries
  • affixes John'll no tokenization
  • John Noun
    Proper Fut
  • clitics John'll John TB 'll TB
  • John Noun
    Proper will Fut
  • Arabic determiners and conjunctions
  • both written with adjacent words
  • determiner as an affix giving Def (Albint
    the-girl)
  • conjunction tokenized separately (wakutub
    and-books)

27
Non-deterministic tokenizers Punctuation
  • Cannot just break off punctuation and insert a TB
  • Comma haplology
  • Find the dog, a poodle.
  • find TB the TB dog TB , TB a TB poodle TB , TB .
    TB
  • Period haplology
  • Go to Palm Dr.
  • go TB to TB Palm TB Dr. TB . TB
  • Resulting tokenizer is non-deterministic
  • System must be able to handle multiple inputs

28
Capitalization
  • Intial capitals are optionally lower cased
  • The boy left. the boy left.
  • Mary left. Mary left.
  • Example for both types of non-determinism
  • Bush saw them.
  • Bush bush TB saw TB them TB , TB . TB
  • Tokenization rules vary from language to language
    and by choice of linguistic analysis

29
Conclusions
  • System architecture integrates FST techniques
    with deep LFG parsing
  • tokenizers
  • morphologies and guessers
  • Allows generalizations to be factored out
  • properties of words
  • properties of strings
  • Allows use of existing large-scale lexical
    resources
  • avoids redundant speficication
  • System is actively in use in ParGram grammars

30
(No Transcript)
31
Shallow Markup
  • Preprocessing with shallow markup can reduce
    ambiguity and speed processing
  • Tokenizer must be able to process the markup
  • Part of speech tagging
  • I/PRP_ saw/VBD_ her/PRP_ duck/VB_.
  • Named entities
  • General Mills bought it.

32
POS tagging
  • POS tags are not relevant for tokenizing, but the
    tokenizer must skip them
  • She walks/VBZ_. should be treated like She walks.
  • The morphology must only insert compatible tags
  • A mapping table states allowable combinations
  • /VBZ_ Verb 3sg
  • /NN_ Noun Sg
  • These are encoded into a filtering FST
  • Only compatible tags are passed to the grammar

33
POS tagging example
  • I saw her duck
  • duck Noun Sg
  • duck Verb Pres Non3sg
  • both possibilities passed to the grammar
  • I saw her duck/VB_.
  • only Verb Pres Non3sg possibility is
    compatible with /VB_ POS tag
  • only this possibility is passed to the grammar

34
Named Entities
  • Named entities appear in text as XML markup
  • General Mills bought it.
  • Tokenizer
  • creates special tag for these
  • puts literal spaces instead of TBs
  • allows version without markup for fallback
  • General Mills TB NamedEntity TB
  • General TB Title TB Mills Proper TB
  • Lexical entry added for NamedEntity
  • Sublexical N and NAME rules allows the tag

35
Sample Named Entity output
Write a Comment
User Comments (0)
About PowerShow.com