Linguistics 239E Week 8 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Linguistics 239E Week 8

Description:

Be careful to make your disjunctions non-overlapping (unless you really mean it) ... (JJ ( Adj ~ Comp ~ Sup ~ Interrog ~ IntRel) ( Num Ord) ( Dig Ord) ( Verb Prog) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 46
Provided by: Franci65
Category:
Tags: 239e | linguistics | ord | week

less

Transcript and Presenter's Notes

Title: Linguistics 239E Week 8


1
Linguistics 239E Week 8
Fragments, Performance limits, Shallow markup
  • Ron Kaplan and Tracy King

2
Issues from Week7 HW
  • Be careful to make your disjunctions
    non-overlapping (unless you really mean it)
  • V3SG ( SUBJ NUM)sg
  • ( SUBJ PERS)3
  • _at_(OT-MARK BadVAgr).
  • he laughs. --gt 11 parses
  • V3SG ( SUBJ NUM)sg
  • ( SUBJ PERS)3
  • ( SUBJ NUM)sg
  • ( SUBJ PERS)3
  • _at_(OT-MARK BadVAgr).
  • you laughs. --gt 2 parses

3
New XLE release
  • 8 bug fixes/improvements based on input from the
    class
  • Windows control-m problem
  • S--gt problem
  • can parse lexical categories
  • closed sets as arguments to templates
  • interpretation of empty feature declaration
  • lexical entries without morph codes
  • index files on Windows
  • documentation contact updated

4
FRAGMENT grammar
  • What to do when the grammar does not get a parse
  • always want some type of output
  • want the output to be maximally useful
  • Why might it fail
  • construction not covered yet
  • "bad" input
  • took too long (XLE parsing parameters)

5
Grammar engineering approach
  • First try to get a complete parse
  • If fail, build up chunks that get complete parses
    (c-str and f-str)
  • Have a fall back for things without even chunk
    parses
  • Link these chunks and fall backs together in a
    single f-structure

6
Basic idea
  • XLE has a REPARSECAT which it tries if there is
    no complete parse
  • Grammar writer specifies what category the
    possible chunks are
  • OT marks are used to
  • build the fewest chunks possible
  • disprefer using the fall back over the chunks

7
Sample output
  • the the dog appears.
  • Split into
  • "token" the
  • sentence "the dog appears"
  • ignore the period

8
C-structure
9
F-structure
10
How to get this
FRAGMENTS --gt NP ( FIRST)!
_at_(OT-MARK Fragment) S ( FIRST)!
_at_(OT-MARK Fragment) TOKEN (
FIRST)! _at_(OT-MARK Fragment)
(FRAGMENTS ( REST)! ).
Lexicon -token TOKEN ( TOKEN)stem
_at_(OT-MARK Token).
11
Why First-Rest?
  • FIRST-REST
  • FIRST PRED
  • REST FIRST PRED
  • REST
  • Efficient
  • Encodes order
  • Possible alternative set
  • PRED
  • PRED
  • Not as efficient (copying)
  • Even less efficient if mark scope facts

12
Accuracy?
  • Evaluation against gold standard
  • PARC 700 f-structure bank for Wall Street
    Journal
  • Measure F-score on dependency triples
  • F-score average of precision and recall
  • Dependency triples separate f-structure
    features
  • Subj(run, dog) Tense(run, past)
  • Results for best-matching f-structure
  • Full parses F88.5
  • Fragment parses F76.7

(Riezler et al, 2002)
13
Fragments summary
  • XLE has a chunking strategy for when the grammar
    does not provide a full analysis
  • Each chunk gets full c-str and f-str
  • The grammar writer defines the chunks based on
    what will be best for that grammar and
    application
  • Quality
  • Fragments have reasonable but degraded f-scores
  • Usefulness in applications is being tested

14
Resource limitations Time and space
15
Exceeding available resources
16
Hard limits Time and storage
  • For some applications
  • No output on a few hard sentences is better than
    getting hung up, never getting to easy ones
  • E.g.
  • Search applications you never find everything
    anyway
  • Grammar testing/debugging no surprise, move on
  • XLE commands
  • set timeout 60 abort after 60
    second
  • set max_xle_scratch_storage 50 abort after 50
    megabytes

17
Soft limits Skimming
  • Bound the f-structure effort per subtree
  • Compute normally until a threshhold is reached
  • set start_skimming_when_scratch_storage_exceeds
    700 (megabytes)
  • set start_skimming_when_total_events_exceed XX
    (some number)
  • (XX estimated from timeouts in test runs)
  • Then limit the number of solutions per edge
  • set max_new_events_per_graph_when_skimming XX
  • Bounded computation/edge ? cubic
  • Result in reasonable time/space
  • At least one solution for every sentence
  • But some solutions will be missed
  • Suppress weighty constituents
  • Limit length of medial constituents
  • set max_medial_constituent_weight 20
  • Dont allow medial edges that span more than 20
    terminals
  • (approximation to avoiding center embedding)

18
Accuracy?
  • Again, evaluation against gold standard
  • PARC 700 f-structure bank for Wall Street
    Journal
  • Results for best-matching f-structure
  • Full parses F88.5
  • Fragment parses F76.7
  • Skimmed parses F70.3
  • Skimmed/Fragments F61.3

(Riezler et al, 2002)
19
Integrating Shallow Mark up Part of speech
tags Named entities Syntactic brackets
20
Shallow mark-up of input strings
  • Part-of-speech tags (tagger?)
  • I/PRP saw/VBD her/PRP duck/VB.
  • I/PRP saw/VBD her/PRP duck/NN.
  • Named entities (named-entity recognizer)
  • ltpersongtGeneral Millslt/persongt bought it.
  • ltcompanygtGeneral Millslt/companygt bought it
  • Syntactic brackets (chunk parser?)
  • NP-S I saw NP-O the girl with the
    telescope.
  • NP-S I saw NP-O the girl with the
    telescope.

21
Hypothesis
  • Shallow mark-up
  • Reduces ambiguity
  • Increases speed
  • Without decreasing accuracy
  • (Helps development)
  • Issues
  • Markup errors may eliminate correct analyses
  • Markup process may be slow
  • Markup may interfere with existing robustness
    mechanisms (optimality, fragments, guessers)
  • Backoff may restore robustness but decrease speed
    in 2-pass system (STOPPOINT)

22
Implementation in XLE
How to integrate with minimal changes to existing
system/grammar?
23
XLE String Processing
lexical forms
Multiwords
Modify sequences
token morphemes
Morph,Guess, Tok
Analyze
tokens
Tthe TB oil TB filter TB s TB gone TB
Decap, split, commas
Tokenize
string
The oil filters gone
24
Part of speech tags
lexical forms
Multiwords
token morphemes
Analyze
  • How do tags pass thru Tokenize/Analyze?
  • Which tags constrain which morphemes?
  • How?

tokens
Tokenize
string
The/DET_ oil/NN_ filter/NN_s/VBZ_
gone/VBN_
25
Passing tags through Tokenizer
  • Tokenizer must treat tag characters specially
  • Must recognize them e.g. xxx/TAG_
  • Must not transform them e.g. x/NN_ ? x/nn_
  • Must not let tags interrupt other patterns
  • e.g. wo/MD_nt/RB_ should behave like
    wont
  • Must split tags off as separate tokens, for
    existing Token path through Analyzer
  • How to do this with minimal changes to existing
    tokenizer FST?

tokens
Tokenize
string
26
Modifying an existing tokenizer
  • Tags shouldnt be transformed
  • Tags shouldnt disrupt any other patterns

Script for xfst program Tokenizer Tag
.o. Tokenizer/Tag
Dont transform
Dont disrupt
Glitch Ignore (/) introduces unwanted ambiguity
around insertions
Solution, a little less modularity Construct
Tokenizer using cover symbol for tags, placing
them wrt insertion Substitute actual
tag-strings for cover symbol
27
Specifying morpheme/pos-tag constraints
  • For each pos-tag, grammar/morphology writer
    specifies by hand the set of compatible morph-tag
    sequences
  • Inputs Description of pos-tag interpretation (
    from Penn document)
  • List of all possible morph-tag sequences from
    analyzer (from program run on Morph/Guesser
    FSTs)
  • Output A text file that characterizes the
    relationship
  • E.g. NNS is plural noun, so text file has
  • (NNS ( Noun Pl) (Noun SP) ( Abbr) )
  • PRP is personal pronoun, so text file has
  • (PRP ( Pron Pers Gen) (Pron Poss) )
  • Lisp program reads file, produces POSFilter
    transducer
  • Allows NNS_ Token sequence only if preceded by
    strings that contain
  • Noun and PL tags, or Noun and SP tags, etc.
  • POSFilter FST is put in MULTIWORD section,
    knocks out undesired morpheme sequences.

28
Excerpts from file
Determiner (DT (Det Interrog) (DetPron
Interrog)) Adjectives (JJ (Adj Comp
Sup Interrog IntRel) (Num Ord) (Dig
Ord) (Verb Prog)) (JJR (Adj Comp))
comparative (JJS (Adj Sup)) superlative
Verbs (VB (Verb Pres 3sg) (Aux Pres 3sg)
(Verb Inf) (Aux Inf)) base
form (VBD (Verb PastTense) (Verb PastBoth)
(Aux PastTense) (Aux PastBoth)) past (VBG
(Verb PresPart) (Verb Prog) (Aux Prog))
gerund, present participle (VBN (Verb
PastPart) (Verb PastPerf) (Aux PastPerf)
(Verb PastBoth) (Aux PastBoth))
past particple (VBP (Verb Pres
3sg) (Aux Pres 3sg)) non 3sg (VBZ (Verb
Pres 3sg) (Aux Pres 3sg)) 3sg
29
All together
lexical forms
Multiwords
POSFilterFST
token morphemes
Analyze
tokens
Tokenize
Tokenize
POSStringFST
string
30
MORPHCONFIG
  • STANDARD ENGLISH MORPHOLOGY (1.0)
  • TOKENIZE
  • ../common/englishpostags.stringfst
    ../common/english.tok.parse.fst
  • ANALYZE
  • ../common/english.infl.fst
  • ../common/english.morph.guesser.fst
  • MULTIWORD
  • ../common/eng-infl-final.posfilterfst
  • BuildMultiwordsFromLexicon
  • Tag Prefer
  • BuildMultiwordsFromMorphology
  • Tag Prefer

31
Embellishments
  • Alternative POS tags in string, if tagger unsure
  • walks/NNSVBZ
  • Can specify that some tags are ignored
  • Redundant with other information ./._
  • Constraints cross-classify too many morph-tag
    sequences
  • Too hard to specify
  • Tags are optional may appear on some but not
    all words

32
Named entities Example input
  • parse ltpersongtMr. Thejskt Thejslt/persongt
    arrived.
  • tokenized string
  • Mr. Thejskt Thejs TB NEperson Mr(TB). TB
    Thejskt TB Thejs

. (.) TB (, TB) .
TB arrived
TB
33
Lexicon
  • Lexical entries for tags
  • NEperson NE_SFX _at_(PROPER name).
  • Lexical entry for token
  • -token TOKEN ( TOKEN)stem
  • NE _at_(NOUN stem)
  • _at_(GRAIN proper)
  • _at_(SOURCE entity-finder)
  • _at_(OT-MARK NamedEntity).

34
Grammar Rules
  • Rules
  • NOUN-ENTITY --gt NE NE_SFX.
  • NOUN --gt
  • _at_NOUN-ENTITY.
  • Config OT Mark
  • (MWE NamedEntity) STOPPOINT

35
Resulting C-structure
36
Resulting F-structure
37
Overriding Bad NE Bracketing
  • Override if no parse found with bracketing
  • For example verb accidentally bracketed
  • parse Mr. ltpersongtAtbeu Thes
    arrivedlt/persongt.

38
Result Normal C-structure
39
Result Normal F-structure
40
Syntactic brackets
  • Chunker labelled bracketing
  • NP-SBJ Mary and John saw NP-OBJ the girl with
    the telescope.
  • They V pushed and pulled the cart.
  • Implementation
  • Tokenizing FST identifies, tokenizes labels
    without interrupting other patterns
  • Bracketing constraints enforced by Metarulemacro
  • METARULEMACRO(_CAT _BASECAT _RHS)
  • _RHS
  • LSB
  • CAT-LB_BASECAT
  • _CAT
  • RSB.

41
Syntactic brackets
  • NP-SBJ Mary appeared.
  • Lexicon NP-SBJ CAT-LBNP (SUBJ ).

S
VP
NP
V appeared
LSB
CAT-LBNP
NP
RSB


NP-SBJ
N Mary
42
Experimental test
  • Again, F-scores on PARC 700 f-structure bank
  • Upper bound Sentences with best-available
    markup
  • POS tags from Penn Tree Bank
  • Some noise from incompatible coding
  • Werner is president of the
    parent/JJ company/NN. Adj-Noun
    vs. our Noun-Noun
  • Some noise from multi-word treatment
  • Kleinword/NNP
    Benson/NNP /CC Co./NNP
  • vs.
    Kleinword_Benson__Co./NNP
  • Named entities hand-coded by us
  • Labeled brackets also approximated by Penn Tree
    Bank
  • Keep core-GF brackets S, NP, VP-under-VP
  • Others are incompatible or unreliable discarded

43
Results
44
(No Transcript)
45
Motivation for Part of Speech Tags
  • Extra source of information for reducing
    ambiguity
  • Online parsing Less confusing, more useful
    results
  • Grammar development Heuristic for determining
    whether grammar gets correct analyses, help in
    building f-structure bank
  • Note Recall is more important than precision
  • Dont want local, probabilistic decisions to
    eliminate globally corrrect analysis
  • Reducing ambiguity in initial parse chart might
    drastically improve speed if
  • Much c-structure ambiguity comes from POS
    ambiguity
  • So chart is more linear than cubic
  • And total time is (more or less) proportional to
    chart size
Write a Comment
User Comments (0)
About PowerShow.com