Linguistics 239E Week 6 - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Linguistics 239E Week 6

Description:

The dog, a poodle, == The TB dog TB , TB a TB poodle TB , TB. Haplology. Find the dog, Muffy. ... The dog - a poodle - barked. Lowercasing ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 37
Provided by: Franci65
Category:

less

Transcript and Presenter's Notes

Title: Linguistics 239E Week 6


1
Linguistics 239E Week 6
FSTs and XLE Grammars
  • Ron Kaplan and Tracy King

2
NLP Reading Group
  • Weekly meetings, Fri 200 - 315, Ventura 17
  • This week
  • S. Bilac, T. Baldwin, and H. Tanaka. 2003.
    Improving dictionary accessibility by maximizing
    use of available knowledge
  • URL http //nlp.stanford.edu/nlp/nlpgroup.html
  • Sign up
  • majordomo_at_lists.stanford.edu
  • subscribe nlp-reading

3
Office Hours/Homework
  • Monday is a holiday
  • Office hours Tuesday 4.00-4.30
  • HW assignment will not be posted until tomorrow
    afternoon

4
Issues from Week 5 Homework
  • "Useless existential constraint"
  • S --gt NP ( SUBJ)!
  • VP.
  • Context free parsing
  • adding the constraints cuts down on the number of
    final analyses
  • the trees are all still there
  • to eliminate trees, you would have to change the
    c-structure in some way

5
Issues from Week 5 Homework
  • Double analysis with "want to"
  • separate out the paths in the functional
    uncertainty so that SUBJ and XCOMP SUBJ do not
    show up together (hard to do in general case)
  • off-path constraint
  • ( COMPXCOMP SUBJ (lt- XCOMP) OBJ
    OBL OBJ)!
  • subtract out the XCOMP SUBJ (recommended
    solution)
  • ( COMPXCOMP SUBJ OBJ OBL OBJ
    ? XCOMP SUBJ )!

6
Issues from Week5 Homework
  • "gap" to create the gap, the constituent in
    question can just be optional e is not needed
  • (NP ( OBJ)!)
  • NP ( OBJ)!
  • e
  • Analysis of "be" (tangential to assignment)
  • "be" should have two arguments
  • it is unclear if the second argument is "open"
    (XCOMP) or "closed" (OBJ/PREDLINK)

7
FSTs in XLE grammars
  • FSTs are used for
  • tokenization
  • morphological analysis
  • Incorporated via the MORPHCONFIG

8
Tokenization
  • Tokenization breaks up a string (sentence) into
    tokens (words)
  • Break off punctuation
  • Break off clitics
  • Lowercasing
  • Allow for markup

9
Punctuation and clitics
  • Simple breaking
  • I see them. gt I TB see TB them TB . TB
  • The dog, a poodle, gt The TB dog TB , TB a TB
    poodle TB , TB
  • Haplology
  • Find the dog, Muffy. gt Find TB the TB dog TB ,
    TB Muffy TB , TB . TB
  • Go to Palm Dr. gt Go TB to TB Palm TB Dr. TB .
    TB
  • Clitics
  • Ill go. gt I TB ll TB go TB . TB

10
Punctuation Problems
  • When to break off punctuation is not always clear
  • Hyphens part of the word or separate
    punctuation?
  • a six-year-old boy
  • a windshield-wiper blade cleaner
  • The dog - a poodle - barked.

11
Lowercasing
  • Need to (optionally) lowercase in certain
    positions (depends on the language)
  • Sentence initially
  • The boy left. gt the boy left.
  • Mary left. gt Mary left.
  • After colons
  • The boy left He was unhappy. gt The boy left
    he was unhappy.
  • All caps
  • Do NOT leave. gt do not leave.
  • IBM did well. gt IBM did well.

12
Tokenizers are non-deterministic
  • Allow for multiple tokenizations to guarantee
    correct one
  • Bush saw them. gt Bush bush TB saw TB them
    TB , TB . TB
  • May include markup
  • All caps lowering marked
  • IBM gt IBM ibm

13
Allowing for markup
  • Normal rules of tokenization (lowercasing,
    haplology) need to skip markup
  • The markup should not be broken up like regular
    punctuation
  • labeled bracketing
  • I see \NP the dog, a poodle\.
  • named entities
  • ltpersongtMr. Smithlt/persongt

14
FST Morphologies
  • Associate a surface form of a word with a
    canonical form (lemma, stem) and a set of tags
  • Tags give grammatical information
  • Part of speech
  • Other information (number, tense, etc)
  • Tags may give additional information
  • Classes of proper nouns (names, locations)

15
Examples English
  • went go "Verb" "PastTense" "123SP"
  • boxes box "Noun" "Pl"
  • "Verb" "Pres" "3sg"
  • Mary "Prop" "Giv" "Fem" "Sg"
  • him he "Pron" "Pers" "Acc" "3P" "Sg"

16
Examples French
  • fleur fleur "Fem" "SG" "Noun"
  • venir venir "Inf" "Verb"
  • vienne venir "SubjP" "SG" "P1""P3"
    "Verb"
  • tour tour "Masc""Fem" "SG" "Noun"
  • France France "Fem" "InvPL" "Country"
    "Proper" "Noun"

17
An example
  • String Children came.
  • Tokens Children children TB came TB ,
    TB . TB
  • Morphology (for the tokens we want)
  • child Noun Pl children Token
  • come Verb PastTense 123SP came
    Token
  • . Punct Sent . Token
  • Outputs from tokenizer and morphology fsts can
    multiply out

18
The process in XLE
XLE words
words
19
Viewing the analysis in XLE
  • If a FST tokenizer is loaded with the grammar
  • tokens Ill try this string.
  • If a FST morphology is loaded with the grammar
  • morphemes testing
  • These results are also visible in the morph
    window (from the c-structure window options)

20
Using FSTs with the grammar
  • Tokenize the string
  • Children came. gt children TB came TB . TB
  • Run the tokens through the morphology
  • child Noun Pl come Verb PastTense 123SP .
    Punct Sent
  • Parse the lemmas and the tags
  • sublexical rules build up the words
  • regular rules build the words into phrases
  • each tag has a lexical entry

21
Lexical entries for stems and tags
  • Like the lexical entries you have seen, only with
    XLE instead of
  • boy N XLE _at_(NOUN boy).
  • Noun N_SFX XLE _at_(PERS 3).
  • Sg N_NUM XLE _at_(NUM sg).
  • Pl N_NUM XLE _at_(NUM pl).
  • Note no entry for boys
  • matches tokens that dont go through FST, XLE
    matches FST output stems

22
Sublexical rules
  • Want to insert rules between the lexical
    categories (e.g. N) and the same category in the
    lexicon
  • But the lexical category only identifies the stem
    or base
  • Sublexical rules combine the base with the
    inflectional tags
  • So, build a category (N) from the base (N_BASE)

23
Sublexical rules cont.
  • Like lexical rules only
  • Add _BASE to the category in the lexicon
  • boy N Noun N_SFX Sg N_NUM
  • Example
  • N --gt N_BASE
  • N_SFX_BASE
  • N_NUM_BASE.
  • When parsing, the sublexical trees are not shown.
    Right click on the leave node (e.g., N) and
    choose "show morphemes" to see them.

24
NP example tree
25
Sublexical rules cont.
  • A --gt A_BASE
  • A_SFX_BASE
  • (A_DEG_BASE). optionality
  • N --gt N_BASE disjunction
  • N_SFX_BASE
  • N_NUM_BASE
  • VN_BASE
  • V_SFX_BASE
  • V_TYPE_BASE. kleene star

26
Using the -unknown entry
  • Words with predictable subcat frames can go
    through the special entry -unknown
  • The tags will constrain the distribution
  • This avoids having to list all adverbs,
    adjectives, nouns, etc.
  • stem picks up the lemma/stem
  • -unknown ADJ XLE _at_(ADJ stem)
  • N XLE _at_(NOUN stem)
  • ADV XLE _at_(ADVERB stem).

27
Lexicon and -unknown
  • Verbs ought to be listed due to their subcat
    frames
  • Idiosyncratic entries for nouns, etc. need to be
    listed
  • But, avoid duplicating the word done by the FST
    morphology in the lexicon--mapping to categories
    done in only one place

28
FST guessers
  • The morphologies are good, but dont have all
    words
  • FST guessers can be written
  • work best for languages with lots of morphology
  • for English
  • -ed can be a verb or adjective
  • -ing can be a verb, noun, or adjective
  • -s can be a plural noun or 3sg verb
  • words starting with capitals can be proper nouns
  • etc.

29
Using multiple FSTs
  • How FSTs are used is declared in the MORPHCONFIG
  • The toy grammars use a default MORPHCONFIG
  • TOKENIZE and ANALYZE sections
  • Sections to specify
  • where the fsts are
  • how to treat multiword expressions

30
Example MORPHCONFIG
  • STANDARD ENGLISH MORPHOLOGY (1.0)
  • TOKENIZE
  • whitespace.fst tokenizer.fst
  • ANALYZE USEFIRST
  • main-morphology.fst
  • english-guesser.fst
  • ANALYZE USEALL
  • eureka-numbers.fst
  • eureka-novel-nouns.txt
  • ----

31
Morphconfig cont.
  • TOKENIZE
  • whitespace.fst tokenizer.fst
  • The fsts listed are composed output of first is
    input to second, etc.
  • Having multiple fsts
  • may avoid problems with large
    compositions
  • allows for modularity

32
Morphconfig cont.
  • ANALYZE USEFIRST
  • main-morphology.fst
  • english-guesser.fst
  • Take as input the individual tokens from the
    tokenizer
  • Apply the analyzers one by one until an analysis
    is found. Once an analysis is found, it stops.
  • Effect of the above example
  • first try to find the analysis in the main
    morphology
  • if that fails, guess the morphological analysis

33
Morphconfig cont.
  • ANALYZE USEALL
  • eureka-numbers.fst
  • eureka-novel-nouns.fst
  • Each morphological analyzer is applied to the
    string, produces union of results
  • In the example, if a string could be both a
    eureka number and a eureka novel noun, it will
    get both analyses
  • It is not necessary to have both USEALL and
    USEFIRST sections.

34
FST/XLE main points
  • XLE allows the incorporation of FSTS through the
    MORPHCONFIG
  • Tokenizers, including special markup, and
    morphological analyzers can be included
  • Large morphological analyzers in conjunction with
    sublexical rules and the unknown lexical item
    reduce the need for lexicon development

35
Demo from the large English grammar
  • tokens and morphemes commands
  • morphology window
  • sublexical rules
  • guesser

36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com