Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some

Description:

Reverse Annotated Feature Structure Sets: add English sentences. Smaller Corpus. Sampling ... Dictionary lookup (Mapudungun to Spanish) Morphological analysis ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 56
Provided by: lsl
Category:

less

Transcript and Presenter's Notes

Title: Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some


1
Building NLP Systems for Two Resource Scarce
Indigenous Languages Mapudungun and Quechua,
and some other languages
  • Christian Monson, Ariadna Font Llitjós, Roberto
    Aranovich, Lori Levin, Ralf Brown, Erik Peterson,
    Jaime Carbonell, and Alon Lavie

2
Omnivorous MT
  • Eat whatever resources are available
  • Eat large or small amounts of data

Mapusaurus Roseae Mapu land Mapuche land
people Mapudungun land speech
3
AVENUEs Inventory
  • Resources
  • Parallel corpus
  • Monolingual corpus
  • Lexicon
  • Morphological Analyzer (lemmatizer)
  • Human Linguist
  • Human non-linguist
  • Techniques
  • Rule based transfer system
  • Example Based MT
  • Morphology Learning
  • Rule Learning
  • Interactive Rule Refinement
  • Multi-Engine MT

This research was funded in part by NSF grant
number IIS-0121-631.
4
Startup without corpus or linguist
Requires someone who is bilingual and literate
5
The Elicitation Tool has been used with these
languages
  • Mapudungun
  • Hindi
  • Hebrew
  • Quechua
  • Aymara
  • Thai
  • Japanese
  • Chinese
  • Dutch
  • Arabic

6
Purpose of Elicitation
  • srcsent Tú caíste
  • tgtsent eymi ütrünagimi
  • aligned ((1,1),(2,2))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) fell
  • srcsent Tú estás cayendo
  • tgtsent eymi petu ütünagimi
  • aligned ((1,1),(2 3,2 3))
  • context tú Juan masculino, 2a persona del
    singular
  • comment You (John) are falling
  • srcsent Tú caíste
  • tgtsent eymi ütrunagimi
  • aligned ((1,1),(2,2))
  • context tú María femenino, 2a persona del
    singular
  • comment You (Mary) fell
  • Provide a small but highly targeted corpus of
    hand aligned data
  • To support machine learning from a small data set
  • To discover basic word order
  • To discover how syntactic dependencies are
    expressed
  • To discover which grammatical meanings are
    reflected in the morphology or syntax of the
    language

7
Feature Structures
  • srcsent Mary was not a leader.
  • context Translate this as though it were spoken
    to a peer co-worker
  • ((actor ((np-function fn-actor)(np-animacy
    anim-human)(np- biological-gender
    bio-gender-female) (np-general-type
    proper-noun-type)(np-identifiability
    identifiable)(np- specificity specific)))
  • (pred ((np-function fn-predicate-nominal)(np-anima
    cy anim- human)(np-biological-gender
    bio-gender-female) (np- general-type
    common-noun-type)(np-specificity
    specificity- neutral)))
  • (c-v-lexical-aspect state)(c-copula-type
    copula-role)(c-secondary-type secondary-copula)(c-
    solidarity solidarity-neutral)
    (c-v-grammatical-aspect gram-aspect-neutral)(c-v-a
    bsolute-tense past) (c-v-phase-aspect
    phase-aspect-neutral) (c-general-type
    declarative-clause)(c-polarity polarity-negative)(
    c-my-causer-intentionality intentionality-n/a)(c-c
    omparison-type comparison-n/a)(c-relative-tense
    relative-n/a)(c-our-boundary boundary-n/a))

8
Current Work
  • Search space
  • Elements of meanings that might be expressed by
    syntax or morphology tense, aspect, person,
    number, gender, causation, evidentiality, etc.
  • Syntactic dependencies subject, object
  • Interactions of features
  • Tense and person
  • Tense and interrogative mood
  • Etc.

9
Current Work
  • For a new language
  • For each item of the search space
  • Eliminate it as irrelevant or
  • Explore it
  • Using as few sentences as possible

10
Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality

List of semantic features and values
Feature Maps which combinations of features and
values are of interest
XML Schema XSLT Script
Feature Structure Sets
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
11
Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality

List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Combination Formalism
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
12
Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality

List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Feature Structure Viewer
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
13
Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality

List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
14
Outline
  • Two ideas
  • Omnivorous MT
  • Startup for low resource situation
  • Four Languages
  • Mapudungun
  • Quechua
  • Hindi
  • Hebrew

15
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
16
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
17
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
18
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
19
Mapudungun Language
  • 900,000 Mapuche people
  • At least 300.000 speakers of Mapudungun
  • Polysynthetic
  • sl pe- rke- fi- ñ
    Maria
  • ver-REPORT-3pO-1pSgS/IND
  • tl DICEN QUE LA VI A MARÍA
  • (They say that) I saw Maria.

20
AVENUE Mapudungun
  • Joint project between Carnegie Mellon University,
    the Chilean Ministry of Education, and
    Universidad de la Frontera.

21
Mapudungun to Spanish Resources
  • Initially
  • Large team of native speakers at Universidad de
    la Frontera, Temuco, Chile
  • Some knowledge of linguistics
  • No knowledge of computational linguistics
  • No corpus
  • A few short word lists
  • No morphological analyzer
  • Later Computational Linguists with non-native
    knowledge of Mapudungun
  • Other considerations
  • Produce something that is useful to the
    community, especially for bilingual education
  • Experimental MT systems are not useful

22
Mapudungun
Corpus 170 hours of spoken Mapudungun
Example Based MT
Spelling checker
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
Spanish Morphology from UPC, Barcelona
23
Mapudungun Products
  • http//www.lenguasamerindias.org/
  • Click traductor mapudungún
  • Dictionary lookup (Mapudungun to Spanish)
  • Morphological analysis
  • Example Based MT (Mapudungun to Spanish)

24
I Didnt see Maria
S
S
VP
VP
NP
NP
a
V
VSuffG
V
no
VSuffG
VSuff
N
pe
vi
N
VSuffG
VSuff
ñ
Maria
María
fi
VSuff
la
25
Transfer to Spanish Top-Down
S
S
VP
VP
VPVP VBar NP -gt VBar "a" NP ( (X1Y1) (X2
Y3) ((X2 type) (NOT personal)) ((X2
human) c ) (X0 X1) ((X0 object) X2)
(Y0 X0) ((Y0 object) (X0 object)) (Y1
Y0) (Y3 (Y0 object)) ((Y1 objmarker person)
(Y3 person)) ((Y1 objmarker number) (Y3
number)) ((Y1 objmarker gender) (Y3 ender)))
NP
NP
a
V
VSuffG
VSuffG
VSuff
N
pe
VSuffG
VSuff
ñ
Maria
fi
VSuff
la
26
AVENUE Hebrew
  • Joint project of Carnegie Mellon University and
    University of Haifa

27
Hebrew Language
  • Native language of about 3-4 Million in Israel
  • Semitic language, closely related to Arabic and
    with similar linguistic properties
  • RootPattern word formation system
  • Rich verb and noun morphology
  • Particles attach as prefixed to the following
    word definite article (H), prepositions
    (B,K,L,M), coordinating conjuction (W),
    relativizers (,K)
  • Unique alphabet and Writing System
  • 22 letters represent (mostly) consonants
  • Vowels represented (mostly) by diacritics
  • Modern texts omit the diacritic vowels, thus
    additional level of ambiguity bare word ? word
  • Example MHGR ? mehager, mhagar, mhger

28
Hebrew Resources
  • Morphological analyzer developed at Technion
  • Constructed our own Hebrew-to-English lexicon,
    based primarily on existing Dahan H-to-E and
    E-to-H dictionary
  • Human Computational Linguists
  • Native Speakers

29
Hebrew
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
30
Flat Seed Rule Generation
31
Compositionality Learning
32
Constraint Learning
33
Quechua facts
  • Agglutinative language
  • A stem can often have 10 to 12 suffixes, but it
    can have up to 28 suffixes
  • Supposedly clear cut boundaries, but in reality
    several suffixes change when followed by certain
    other suffixes
  • No irregular verbs, nouns or adjectives
  • Does not mark for gender
  • No adjective agreement
  • No definite or indefinite articles (topic and
    focus markers perform a similar task of
    articles and intonation in English or Spanish)

34
Quechua examples
  • takini (also written takiniy)
  • sing 1sg (I sing) ? canto
  • takishani (takishaniy)
  • sing progr 1sg (I am singing) ? estoy
    cantando
  • takipakuqchu?
  • taki sing
  • -paku to join a group to do something
  • -q agentive
  • -chu interrogative
  • ? (para) cantar con la gente (del pueblo)?
  • (to sing with the people (of the village)?)

35
Quechua Resources
  • A few native speakers, not linguists
  • A computational linguist learning Quechua
  • Two fluent, but non-native linguists

36
Quechua
Parallel Corpus OCR with correction
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
37
Grammar rules
cantando
  • takishani -gt estoy cantando (I am singing)
  • VBar,3
  • VBarVBar V VSuff VSuff -gt V V
  • ( (X1Y2)
  • ((x0 person) (x3 person))
  • ((x0 number) (x3 number))
  • ((x2 mood) c ger)
  • ((y2 mood) (x2 mood))
  • ((y1 form) c estar)
  • ((y1 person) (x3 person))
  • ((y1 number) (x3 number))
  • ((y1 tense) (x3 tense))
  • ((x0 tense) (x3 tense))
  • ((y1 mood) (x3 mood))
  • ((x3 inflected) c )
  • ((x0 inflected) ))

Spanish Morphology Generation
lex cantar mood ger
lex estar person 1 number sg tense
pres mood ind
estoy
38
Hindi Resources
  • Large statistical lexicon from the Linguistic
    Data Consortium (LDC)
  • Parallel Corpus from LDC
  • Morphological Analyzer-Generator from LDC
  • Lots of native speakers
  • Computational linguists with little or no
    knowledge of Hindi
  • Experimented with the size of the parallel corpus
  • Miserly and large scenarios

39
Hindi
EBMT
Parallel Corpus
SMT
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
15,000 Noun Phrases from Penn TreeBank
Supported by DARPA TIDES
40
Manual Transfer Rules Example
NP PP NP1 NP P Adj N
N1 ke eka aXyAya N
jIvana
NP NP1 PP Adj N
P NP one chapter of N1
N life
NP1 ke NP2 -gt NP2 of NP1 Ex jIvana ke
eka aXyAya life of (one) chapter
gt a chapter of life NP,12 NPNP PP
NP1 -gt NP1 PP ( (X1Y2) (X2Y1) ((x2
lexwx) 'kA') ) NP,13 NPNP NP1 -gt
NP1 ( (X1Y1) ) PP,12 PPPP NP Postp
-gt Prep NP ( (X1Y2) (X2Y1) )
41
Hindi-English
Very miserly training data. Seven combinations of
components Strong decoder allows
re-ordering Three automatic scoring metrics
42
Extra Slides
43
The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
44
Feature Specification
  • Defines Features and their values
  • Sets default values for features
  • Specifies feature requirements and restrictions
  • Written in XML

45
Feature Specification
  • Feature c-copula-type(a copula is a verb like
    be some languages do not have copulas)
  • Values     
  • copula-n/a    Restrictions 1.
    (c-secondary-type secondary-copula) Notes
  • copula-role    Restrictions 1.
    (c-secondary-type secondary-copula) Notes 1. A
    role is something like a job or a function. "He
    is a teacher" "This is a vegetable
    peeler"copula-identity    Restrictions 1.
    (c-secondary-type secondary-copula)
  • Notes 1. "Clark Kent is Superman" "Sam is the
    teacher" copula-location    Restrictions 1.
    (c-secondary-type secondary-copula)
  • Notes 1. "The book is on the table" There is a
    long list of locative relations later in the
    feature specification. copula-description    Res
    trictions 1. (c-secondary-type secondary-copula)
  • Notes 1. A description is an attribute. "The
    children are happy." "The books are long."

46
Feature Maps
  • Some features interact in the grammar
  • English s reflects person and number of the
    subject and tense of the verb.
  • In expressing the English present progressive
    tense, the auxiliary verb is in a different place
    in a question and a statement
  • He is running.
  • Is he running?
  • We need to check many, but not all combinations
    of features and values.
  • Using unlimited feature combinations leads to an
    unmanageable number of sentences

47
(No Transcript)
48
Evidentiality Map
Lexical Aspect Assertiveness Polarity Source T
ense Gram. Aspect
activity-accomplishment
Assertiveness-asserted, Assetiveness-neutral
Polarity-positive, Polarity-negative
Hearsay, quotative, inferred, assumption
Visual, Auditory, non-visual-or-auditory
Past
Present, Future
Past
Present
Perfective, progressive, habitual, neutral
Perfective, progressive, habitual, neutral
habitual, neutral, progressive
habitual, neutral, progressive
49
Current Work
  • Navigation
  • Start large search space of all possible feature
    combinations
  • Finish each feature has been eliminated as
    irrelevant or has been explored
  • Goal dynamically find the most efficient path
    through the search space for each language.

50
Current Work
  • Feature Detection
  • Which features have an effect on morphosyntax?
  • What is the effect?
  • Drives the Navigation process

51
Feature Detection Spanish
  • The girl saw a red book.
  • ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
  • La niña vió un libro rojo
  • A girl saw a red book
  • ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
  • Una niña vió un libro rojo
  • I saw the red book
  • ((1,1)(2,2)(3,3)(4,5)(5,4))
  • Yo vi el libro rojo
  • I saw a red book.
  • ((1,1)(2,2)(3,3)(4,5)(5,4))
  • Yo vi un libro rojo
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subj, obj
  • Marked-on-head-of- no
  • Marked-on-dependent yes
  • Marked-on-governor no
  • Marked-on-other no
  • Add/delete-word no
  • Change-in-alignment no

52
Feature Detection Chinese
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subject
  • Marked-on-head-of- no
  • Marked-on-dependent no
  • Marked-on-governor no
  • Add/delete-word yes
  • Change-in-alignment no
  • A girl saw a red book.
  • ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
  • ? ?? ?? ?? ? ?? ?? ? ? ?
  • The girl saw a red book.
  • ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
  • ?? ?? ? ?? ??? ?

53
Feature Detection Chinese
  • I saw the red book
  • ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
  • ??? ?, ? ?? ?
  • I saw a red book.
  • ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))
  • ? ?? ? ?? ??? ? ?
  • Feature definitenes
  • Values definite, indefinite
  • Function-of- object
  • Marked-on-head-of- no
  • Marked-on-dependent no
  • Marked-on-governor no
  • Add/delete-word yes
  • Change-in-alignment yes

54
Feature Detection Hebrew
  • Feature definiteness
  • Values definite, indefinite
  • Function-of- subj, obj
  • Marked-on-head-of- yes
  • Marked-on-dependent yes
  • Marked-on-governor no
  • Add-word no
  • Change-in-alignment no
  • A girl saw a red book.
  • ((2,1) (3,2)(5,4)(6,3))
  • ???? ???? ??? ????
  • The girl saw a red book
  • ((1,1)(2,1)(3,2)(5,4)(6,3))
  • ????? ???? ??? ????
  • I saw a red book.
  • ((2,1)(4,3)(5,2))
  • ????? ??? ????
  • I saw the red book.
  • ((2,1)(3,3)(3,4)(4,4)(5,3))
  • ????? ?? ???? ?????

55
Feature Detection Feeds into
  • Corpus Navigation which minimal pairs to pursue
    next.
  • Dont pursue gender in Mapudungun
  • Do pursue definiteness in Hebrew
  • Morphology Learning
  • Morphological learner identifies the forms of the
    morphemes
  • Feature detection identifies the functions
  • Rule learning
  • Rule learner will have to learn a constraint for
    each morpho-syntactic marker that is discovered
  • E.g., Adjectives and nouns agree in gender,
    number, and definiteness in Hebrew.
Write a Comment
User Comments (0)
About PowerShow.com