Data Elicitation for AVENUE - PowerPoint PPT Presentation

About This Presentation
Title:

Data Elicitation for AVENUE

Description:

For a language with little or no digitized language resources ... srcsent: Canto. context: comment: newpair. srcsent: Cant . context: comment: newpair ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 40
Provided by: nos8
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Elicitation for AVENUE


1
Data Elicitation for AVENUE
  • By Alison Alvarez
  • Lori Levin
  • Bob Frederking
  • Jeff Good (MPI Leipzig)
  • Erik Peterson

2
Avenue System Diagram
3
Goals for Corpus Creation and Elicitation
  • Parallel corpus with high quality word alignment
  • For a language with little or no digitized
    language resources
  • Use a bilingual informant with no linguistic
    expertise

4
Outline
  • Elicitation
  • Feature Detection
  • The Functional-Typological Corpus
  • Corpus Creation and Elicitation
  • Corpus Navigation

5
The Elicitation Tool
6
Input to the Elicitation Tool
  • Eliciting from English
  • 1,2,3 Sg,pl person pronouns
  • newpair
  • srcsent I sing
  • context
  • comment
  • newpair
  • srcsent I sang
  • context
  • comment
  • newpair
  • srcsent I am singing
  • context
  • comment
  • newpair
  • srcsent You sang
  • Eliciting from Spanish
  • 1,2,3 Sg,pl person pronouns
  • newpair
  • srcsent Canto
  • context
  • comment
  • newpair
  • srcsent Canté
  • context
  • comment
  • newpair
  • srcsent Estoy cantando
  • context
  • comment
  • newpair
  • srcsent Cantaste

7
Output of the elicitation process
  • newpair
  • srcsent Tú caíste
  • tgtsent eymi ütrünagimi
  • aligned ((1,1),(2,2))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) fell
  • newpair
  • srcsent Tú estás cayendo
  • tgtsent eymi petu ütünagimi
  • aligned ((1,1),(2 3,2 3))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) are falling
  • newpair
  • srcsent Tú caíste
  • tgtsent eymi, ütrunagimi
  • aligned ((1,1),(2,2))
  • context tú María femenino, 2a persona del
    singular

8
Elicitation Corpus
  • Elicitation Corpus refers to the list of
    sentences in the major language.
  • Not yet translated or aligned
  • Field workers call it a questionnaire.

9
Feature Detection
  • Identify meaning components that have
    morpho-syntactic consequences in the language
    that is being elicited.
  • The gender of the subject is marked on the verb
    in Hebrew.
  • The gender of the subject has no morpho-syntactic
    realization in Mapudungun.

10
Feature detection feeds into
  • Corpus Navigation which minimal pairs to pursue
    next.
  • Dont pursue gender in Mapudungun
  • Do pursue definiteness in Hebrew
  • Morphology Learning
  • Morphological rule learner identifies the forms
    of the morphemes
  • Feature detection identifies the functions
  • Rule learning
  • Rule learner will have to learn a constraints
    corresponding to fact records.
  • E.g., Adjectives and nouns agree in gender,
    number, and definiteness in Hebrew.

11
Other uses of Feature Detection
  • A human-readable reference grammar can be
    generated from fact records.
  • A human analyst knows Northern Ostyak, and then
    has to translate a document in Eastern Ostyak.
    The only reference grammar of Eastern Ostyak is
    written in Hungarian, which the analyst does not
    speak. An Eastern Ostyak consultant who speaks
    Russian translates the Elicitation Corpus from
    Russian to Eastern Ostyak. The analyst learns
    about Eastern Ostyak from the automatically
    generated fact records.

12
Other uses of Feature Detection
  • A human-readable reference grammar can be
    generated from fact records.
  • A human analyst knows Northern Ostyak, and then
    has to translate a document in Eastern Ostyak.
    The only reference grammar of Eastern Ostyak is
    written in Hungarian, which the analyst does not
    speak. An Eastern Ostyak consultant who speaks
    Russian translates the Elicitation Corpus from
    Russian to Eastern Ostyak. The analyst learns
    about Eastern Ostyak from the automatically
    generated fact records.
  • Im not really sure whether the only grammar of
    Eastern Ostyak is written in Hungarian. There is
    one reference grammar of Northern Ostyak written
    in English. All other Ostyak materials are in
    Hungarian, Russian, and German.
  • The Ostyaks are subsistence hunters, and Eastern
    Ostyak is nearly extinct, so there is no real
    need for government translators.
  • Other Siberian and Central Asian languages with
    similar scarcity of resources may be important.

13
Other uses of Feature Detection
  • Help a field worker
  • Instead of Elicit by day analyze by night (in
    order to know what to elicit the next day), go to
    sleep and look at the fact records in the
    morning.
  • We have been working with people at EMELD and MPI
    Leipzig.

14
Feature Detection Spanish
  • The girl saw a red book.
  • ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
  • La niña vió un libro rojo
  • A girl saw a red book
  • ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
  • Una niña vió un libro rojo
  • I saw the red book
  • ((1,1)(2,2)(3,3)(4,5)(5,4))
  • Yo vi el libro rojo
  • I saw a red book.
  • ((1,1)(2,2)(3,3)(4,5)(5,4))
  • Yo vi un libro rojo
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subj, obj
  • Marked-on-head-of- no
  • Marked-on-dependent yes
  • Marked-on-governor no
  • Marked-on-other no
  • Add/delete-word no
  • Change-in-alignment no

15
Feature Detection Chinese
  • A girl saw a red book.
  • ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
  • ? ?? ?? ?? ? ?? ?? ? ? ?
  • The girl saw a red book.
  • ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
  • ?? ?? ? ?? ??? ?
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subject
  • Marked-on-head-of- no
  • Marked-on-dependent no
  • Marked-on-governor no
  • Add/delete-word yes
  • Change-in-alignment no

16
Feature Detection Chinese
  • I saw the red book
  • ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
  • ??? ?, ? ?? ?
  • I saw a red book.
  • ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))
  • ? ?? ? ?? ??? ? ?
  • Feature definitenes
  • Values definite, indefinite
  • Function-of- object
  • Marked-on-head-of- no
  • Marked-on-dependent no
  • Marked-on-governor no
  • Add/delete-word yes
  • Change-in-alignment yes

17
Feature Detection Hebrew
  • A girl saw a red book.
  • ((2,1) (3,2)(5,4)(6,3))
  • ???? ???? ??? ????
  • The girl saw a red book
  • ((1,1)(2,1)(3,2)(5,4)(6,3))
  • ????? ???? ??? ????
  • I saw a red book.
  • ((2,1)(4,3)(5,2))
  • ????? ??? ????
  • I saw the red book.
  • ((2,1)(3,3)(3,4)(4,4)(5,3))
  • ????? ?? ???? ?????
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subj, obj
  • Marked-on-head-of- yes
  • Marked-on-dependent yes
  • Marked-on-governor no
  • Add-word no
  • Change-in-alignment no

18
AVENUE Elicitation Corpora
  • The Functional-Typological Corpus
  • Based on microtheories of meanings that may have
    morpho-syntactic realization
  • The Structural Elicitation Corpus
  • Based on sentence structures from the Penn
    TreeBank

19
The Functional Typological Corpus
  • lt/featuregt
  • ltfeaturegt
  • ltfeature-namegtc-my-polaritylt/feature-namegt
  • ltvaluegt
  • ltvalue-namegtpolarity-positivelt/value-namegt
  • lt/valuegt
  • ltvaluegt
  • ltvalue-namegtpolarity-negativelt/value-namegt
  • lt/valuegt
  • ltnotegtStick to the two obvious values of polarity
    for now.lt/notegt
  • lt/featuregt
  • Feature Name c-my-polarity
  • Values positive, negative
  • Note Stick to the two obvious values of polarity
    for now.

20
Functional Typological Corpus
  • In XML
  • XSLT scripts can format it into human-readable
    text or into data structures.
  • Currently contains around 50 features and a few
    hundred values.
  • Still under development.

21
Functional Typological Corpus Representation of
Who is at the meeting
  • ((subj ((np-my-general-type pronoun-type)(np-my-p
    erson person-unk)
  • (np-my-number num-sg)(np-my-animacy anim-human)
  • (np-my-function fn-predicatee)
  • (np-d-my-distance-from-speaker distance-neutral)
  • (np-my-emphasis emph-no-emph)
  • (np-my-info-function info-neutral)
  • (np-pronoun-exclusivity exclusivity-n/a)
  • (np-pronoun-antecedent-function antecedent-n/a)
  • (np-pronoun-reflexivity reflexivity-n/a)))
  • (predicate ((loc-roles loc-general-at)))
  • Continued on next slide

22
Continued Who is at the meeting
  • (c-my-copula-type locative)(c-my-secondary-type
    secondary-copula) (c-my-polarity
    polarity-positive) (c-my-function
    fn-main-clause)(c-my-general-type
    open-question)(gap-function gap-copula-subject)(c-
    my-sp-act sp-act-request-information)(c-v-my-gramm
    atical-aspect gram-aspect-neutral)(c-v-my-absolute
    -tense present) (c-v-my-phase-aspect
    durative)(c-my-headedness-rc rc-head-n/a)(c-my-min
    or-type minor-n/a)(c-my-restrictivess-rc
    rc-restrictive-n/a)(c-my-answer-type
    ans-n/a)(c-my-imperative-degree
    imp-degree-n/a)(c-my-actor's-status
    actor-neutral)(c-my-focus-rc focus-n/a)(c-my-gaps-
    function gap-n/a)(c-my-relative-tense
    relative-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-
    sem-role actor-sem-role-neutral)(c-v-my-lexical-as
    pect state))

23
Why is the corpus represented as a set of feature
structures?
  • Multiple elicitation languages
  • Generate the English and Spanish elicitation
    corpora from the same internal representation
  • Easy to add a new elicitation language
  • Write a GenKit grammar to generate sentences from
    the same internal representation

24
Why is the corpus represented as a set of feature
structures?
  • Feature structure represents things that are not
    expressed in the major language
  • These things show up as comments in the
    elicitation corpus
  • I am singing (comment female)
  • May eventually use pictures and discourse context
  • We actually want to elicit the meaning associated
    with the feature structure. English and Spanish
    are just vehicles for getting at the meaning.

25
Corpus Creation Tools
  • The elicitation corpus can be changed and new
    corpora can be created.

26
Motivation for Corpus Creation Tools
  • Make new corpora easily
  • Add a new tense (e.g., remote past) and
    automatically get all the combinations with other
    features
  • Make a specialized corpus for a limited semantic
    domain or a specific language family

27
Motivation for Corpus Creation Tools
  • Combinatorics
  • For example, all combinations of person, number,
    gender, tense, etc.
  • Too much bookkeeping for a human corpus creator,
    and too time consuming

28
Where do the feature structures come from?
  • A linguist formulates a Multiply
  • The multiply specifies a set of feature structures

29
A Multiply
  • ((subj ((np-my-general-type pronoun-type
    common-noun-type)
  • (np-my-person person-first person-second
    person-third)
  • (np-my-number num-sg num-pl)
  • (np-my-biological-gender bio-gender-male
    bio-gender-female) (np-my-function
    fn-predicatee)))
  • (predicate ((np-my-general-type
    common-noun-type)
  • (np-my-definiteness definiteness-minus)
    (np-my-person person-third)
  • (np-my-function predicate))) (c-my-copula-type
    role)
  • (predicate ((adj-my-general-type
    quality-type))) (c-my-copula-type attributive)
  • (predicate ((np-my-general-type
    common-noun-type)
  • (np-my-person person-third) (np-my-definiteness
    definiteness-plus)
  • (np-my-function predicate))) (c-my-copula-type
    identity)
  • (c-my-secondary-type secondary-copula)
    (c-my-polarity all)
  • (c-my-function fn-main-clause)(c-my-general-type
    declarative)
  • (c-my-speech-act sp-act-state) (c-v-my-grammatical
    -aspect gram-aspect-neutral)
  • (c-v-my-lexical-aspect state) (c-v-my-absolute-ten
    se past present future)
  • (c-v-my-phase-aspect durative))
  • This multiply expands to 288 feature structures.

30
There is a GUI for making Multiplies
  • Demo available on request

31
GenKit Grammar
  • Use GenKit for generation
  • declarative
  • (ltsgt gt (ltnpgt ltvpgt ltnpgt ltscgt)
  • (((x0 c-my-general-type) c declarative)
  • ((x2 verb-form) fin)
  • ((x3 c-my-copula-type) (x0
    c-my-copula-type))
  • ((x4 d-speaker-gender) (x0
    d-speaker-gender))
  • ((x4 d-hearer-gender) (x0
    d-hearer-gender))
  • ((x4 d-my-formality) (x0
    d-my-formality))
  • ((x3 np-my-number) (x0 np-my-number))
  • ((x3 np-my-animacy) (x0
    np-my-animacy))
  • ((x3 np-my-biological-gender) (x0
    np-my-biological-gender))
  • (x3 (x0 predicate))
  • (x1 (x0 subj))
  • (x2 x0)))

32
GenKit Lexicon
  • Pronouns
  • (word ((cat n) (root you) (pred pro)
    (np-my-person person-second)
  • (np-my-animacy anim-human)
    (np-my-general-type pronoun-type)))
  • (word ((cat n) (root I) (pred pro)
    (np-my-person person-first) (np-my-number num-sg)
  • (np-my-animacy anim-human)
    (np-my-general-type pronoun-type)))
  • (word ((cat n) (root we) (pred pro)
    (np-my-person person-first) (np-my-number num-pl)
  • (np-my-animacy anim-human)
    (np-my-general-type pronoun-type)))
  • (word ((cat n) (root we) (pred pro)
    (np-my-person person-first)
  • (np-my-number num-dual) (np-my-animacy
    anim-human)
  • (np-my-general-type pronoun-type)))
  • (word ((cat n) (root she) (pred pro)
    (np-my-person person-third)
  • (np-my-number num-sg) (np-my-biological-g
    ender bio-gender-female)
  • (np-my-animacy anim-human)
    (np-my-general-type pronoun-type)))

33
Comments are also generated
  • I one female sang
  • Use comments for things that are not expressed in
    English.

34
Convert to Elicitation Format(input to
Elicitation Tool)
  • original WHO IS AT THE BOX
  • full comment
  • Sentence WHO IS AT THE BOX
  • original I ONE-WOMAN AM PN_FEMALE ONE-WOMAN
  • full comment NP1 ONE-WOMAN
  • Sentence I AM PN_FEMALE
  • original WILL I ONE-WOMAN BE THE TEACHER
  • full comment NP1 ONE-WOMAN
  • Sentence WILL I BE THE TEACHER

35
Eight Basic Steps for Corpus Creation
  1. Write FVD and format into data structure
  2. Gather Exclusions (restrictions on co-occurrence
    of features
  3. Design the Multiply
  4. Get a full set of Feature Structures
  5. Design Grammar and Comments
  6. Design Lexicon
  7. Generate Sentences from Feature Structures
  8. Convert to Elicitation Format

36
Can make other types of corpora
  • The Elicitation Corpus does not have to be
    functional-typological

37
Alternative Corpora The Medical Corpus
((subj ((body-parts all) (Poss
((np-my-general-type pronoun-type)
(np-my-person all)
(np-my-number num-sg num-pl)
(np-my-animacy anim-human)
(np-my-use possessive))) (Pred ((symptoms
all)) (c-my-general-type declarative) (c-my-spee
ch-act sp-act-state) (c-v-my-grammatical-aspect
gram-aspect-neutral) (c-v-my-lexical-aspect
state) (c-v-my-absolute-tense present))
  • Feature Body-PartsValues     
  • part-hand   Restrictions
  • part-finger  Restrictions
  • part-tooth   Restrictions symptom_redness
  • symptom_scratch symptom_numbness
  • symptom_cut symptom_lump
  • symptom_rash
  • symptom_puncture
  • symptom_bruise
  • symptom_frozen
  • part-eye    Restrictions symptom_rash
  • part-arm    Restrictions

The Result YOUR ARM IS RED YOUR ARM IS
SCRATCHED YOUR ARM IS NUMB YOUR ARM IS
NIL YOUR ARM HAS A/N INFECTION
38
Corpus Navigation
  • While the Elicitation Corpus for any one target
    language (TL) can be kept to a reasonable size,
    the universal Elicitation Corpus must check for
    all phenomena that might occur in any langauge.
  • Since the universal corpus cannot be kept to a
    reasonable size, Corpus Navigation is necessary.
  • Facts discovered about a particular TL early in
    the process constrain what needs to be looked for
    later in the process for that TL. Thus this is a
    dynamic process, different for each TL.

39
Corpus Navigation search
  • Search process, with the informant in the inner
    loop, expanding search states he/she is given as
    SL sentences by supplying the corresponding TL
    sentence and alignments.
  • Analogously to game search, there is an "opening
    book" of moves (SL sentences to check for all
    languages), until enough inforrmation has been
    gathered to make intelligent search choices.
  • The hueristic function driving the search process
    is Relative Info Gain
  • RIG(YX) H(Y) - H(YX)/H(Y)
  • The system reduces the remaining entropy in its
    knowledge of the language as much as possible.
  • There should also be a cost factor, estimating
    the human effort required to expand the node.
  • To make the process efficient enough, we will
    create "decision graphs", similar to RETE
    networks, that cache information so only the
    information that changes needs to be recomputed.
Write a Comment
User Comments (0)
About PowerShow.com