Title: Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some
1Building NLP Systems for Two Resource Scarce
Indigenous Languages Mapudungun and Quechua,
and some other languages
- Christian Monson, Ariadna Font Llitjós, Roberto
Aranovich, Lori Levin, Ralf Brown, Erik Peterson,
Jaime Carbonell, and Alon Lavie
2Omnivorous MT
- Eat whatever resources are available
- Eat large or small amounts of data
Mapusaurus Roseae Mapu land Mapuche land
people Mapudungun land speech
3AVENUEs Inventory
- Resources
- Parallel corpus
- Monolingual corpus
- Lexicon
- Morphological Analyzer (lemmatizer)
- Human Linguist
- Human non-linguist
- Techniques
- Rule based transfer system
- Example Based MT
- Morphology Learning
- Rule Learning
- Interactive Rule Refinement
- Multi-Engine MT
This research was funded in part by NSF grant
number IIS-0121-631.
4Startup without corpus or linguist
Requires someone who is bilingual and literate
5The Elicitation Tool has been used with these
languages
- Mapudungun
- Hindi
- Hebrew
- Quechua
- Aymara
- Thai
- Japanese
- Chinese
- Dutch
- Arabic
6Purpose of Elicitation
- srcsent Tú caíste
- tgtsent eymi ütrünagimi
- aligned ((1,1),(2,2))
- context tú Juan masculino, 2a persona del
singular - comment You (John) fell
- srcsent Tú estás cayendo
- tgtsent eymi petu ütünagimi
- aligned ((1,1),(2 3,2 3))
- context tú Juan masculino, 2a persona del
singular - comment You (John) are falling
- srcsent Tú caíste
- tgtsent eymi ütrunagimi
- aligned ((1,1),(2,2))
- context tú María femenino, 2a persona del
singular - comment You (Mary) fell
- Provide a small but highly targeted corpus of
hand aligned data - To support machine learning from a small data set
- To discover basic word order
- To discover how syntactic dependencies are
expressed - To discover which grammatical meanings are
reflected in the morphology or syntax of the
language
7Feature Structures
- srcsent Mary was not a leader.
- context Translate this as though it were spoken
to a peer co-worker - ((actor ((np-function fn-actor)(np-animacy
anim-human)(np- biological-gender
bio-gender-female) (np-general-type
proper-noun-type)(np-identifiability
identifiable)(np- specificity specific))) - (pred ((np-function fn-predicate-nominal)(np-anima
cy anim- human)(np-biological-gender
bio-gender-female) (np- general-type
common-noun-type)(np-specificity
specificity- neutral))) - (c-v-lexical-aspect state)(c-copula-type
copula-role)(c-secondary-type secondary-copula)(c-
solidarity solidarity-neutral)
(c-v-grammatical-aspect gram-aspect-neutral)(c-v-a
bsolute-tense past) (c-v-phase-aspect
phase-aspect-neutral) (c-general-type
declarative-clause)(c-polarity polarity-negative)(
c-my-causer-intentionality intentionality-n/a)(c-c
omparison-type comparison-n/a)(c-relative-tense
relative-n/a)(c-our-boundary boundary-n/a))
8Current Work
- Search space
- Elements of meanings that might be expressed by
syntax or morphology tense, aspect, person,
number, gender, causation, evidentiality, etc. - Syntactic dependencies subject, object
- Interactions of features
- Tense and person
- Tense and interrogative mood
- Etc.
9Current Work
- For a new language
- For each item of the search space
- Eliminate it as irrelevant or
- Explore it
- Using as few sentences as possible
10Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality
List of semantic features and values
Feature Maps which combinations of features and
values are of interest
XML Schema XSLT Script
Feature Structure Sets
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
11Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality
List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Combination Formalism
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
12Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality
List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Feature Structure Viewer
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
13Tools for Creating Elicitation Corpora
Tense Aspect
Clause-Level
Noun-Phrase
Feature Specification
Modality
List of semantic features and values
Feature Maps which combinations of features and
values are of interest
Feature Structure Sets
Reverse Annotated Feature Structure Sets add
English sentences
The Corpus
Sampling
Smaller Corpus
14Outline
- Two ideas
- Omnivorous MT
- Startup for low resource situation
- Four Languages
- Mapudungun
- Quechua
- Hindi
- Hebrew
15The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
16The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
17The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
18The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
19Mapudungun Language
- 900,000 Mapuche people
- At least 300.000 speakers of Mapudungun
- Polysynthetic
- sl pe- rke- fi- ñ
Maria - ver-REPORT-3pO-1pSgS/IND
- tl DICEN QUE LA VI A MARÍA
- (They say that) I saw Maria.
20AVENUE Mapudungun
- Joint project between Carnegie Mellon University,
the Chilean Ministry of Education, and
Universidad de la Frontera.
21Mapudungun to Spanish Resources
- Initially
- Large team of native speakers at Universidad de
la Frontera, Temuco, Chile - Some knowledge of linguistics
- No knowledge of computational linguistics
- No corpus
- A few short word lists
- No morphological analyzer
- Later Computational Linguists with non-native
knowledge of Mapudungun - Other considerations
- Produce something that is useful to the
community, especially for bilingual education - Experimental MT systems are not useful
22Mapudungun
Corpus 170 hours of spoken Mapudungun
Example Based MT
Spelling checker
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
Spanish Morphology from UPC, Barcelona
23Mapudungun Products
- http//www.lenguasamerindias.org/
- Click traductor mapudungún
- Dictionary lookup (Mapudungun to Spanish)
- Morphological analysis
- Example Based MT (Mapudungun to Spanish)
24I Didnt see Maria
S
S
VP
VP
NP
NP
a
V
VSuffG
V
no
VSuffG
VSuff
N
pe
vi
N
VSuffG
VSuff
ñ
Maria
María
fi
VSuff
la
25Transfer to Spanish Top-Down
S
S
VP
VP
VPVP VBar NP -gt VBar "a" NP ( (X1Y1) (X2
Y3) ((X2 type) (NOT personal)) ((X2
human) c ) (X0 X1) ((X0 object) X2)
(Y0 X0) ((Y0 object) (X0 object)) (Y1
Y0) (Y3 (Y0 object)) ((Y1 objmarker person)
(Y3 person)) ((Y1 objmarker number) (Y3
number)) ((Y1 objmarker gender) (Y3 ender)))
NP
NP
a
V
VSuffG
VSuffG
VSuff
N
pe
VSuffG
VSuff
ñ
Maria
fi
VSuff
la
26AVENUE Hebrew
- Joint project of Carnegie Mellon University and
University of Haifa
27Hebrew Language
- Native language of about 3-4 Million in Israel
- Semitic language, closely related to Arabic and
with similar linguistic properties - RootPattern word formation system
- Rich verb and noun morphology
- Particles attach as prefixed to the following
word definite article (H), prepositions
(B,K,L,M), coordinating conjuction (W),
relativizers (,K) - Unique alphabet and Writing System
- 22 letters represent (mostly) consonants
- Vowels represented (mostly) by diacritics
- Modern texts omit the diacritic vowels, thus
additional level of ambiguity bare word ? word - Example MHGR ? mehager, mhagar, mhger
28Hebrew Resources
- Morphological analyzer developed at Technion
- Constructed our own Hebrew-to-English lexicon,
based primarily on existing Dahan H-to-E and
E-to-H dictionary - Human Computational Linguists
- Native Speakers
29Hebrew
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
30Flat Seed Rule Generation
31Compositionality Learning
32Constraint Learning
33Quechua facts
- Agglutinative language
- A stem can often have 10 to 12 suffixes, but it
can have up to 28 suffixes - Supposedly clear cut boundaries, but in reality
several suffixes change when followed by certain
other suffixes - No irregular verbs, nouns or adjectives
- Does not mark for gender
- No adjective agreement
- No definite or indefinite articles (topic and
focus markers perform a similar task of
articles and intonation in English or Spanish)
34Quechua examples
- takini (also written takiniy)
- sing 1sg (I sing) ? canto
- takishani (takishaniy)
- sing progr 1sg (I am singing) ? estoy
cantando - takipakuqchu?
- taki sing
- -paku to join a group to do something
- -q agentive
- -chu interrogative
- ? (para) cantar con la gente (del pueblo)?
- (to sing with the people (of the village)?)
35Quechua Resources
- A few native speakers, not linguists
- A computational linguist learning Quechua
- Two fluent, but non-native linguists
36Quechua
Parallel Corpus OCR with correction
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
37Grammar rules
cantando
- takishani -gt estoy cantando (I am singing)
- VBar,3
- VBarVBar V VSuff VSuff -gt V V
- ( (X1Y2)
- ((x0 person) (x3 person))
- ((x0 number) (x3 number))
- ((x2 mood) c ger)
- ((y2 mood) (x2 mood))
- ((y1 form) c estar)
- ((y1 person) (x3 person))
- ((y1 number) (x3 number))
- ((y1 tense) (x3 tense))
- ((x0 tense) (x3 tense))
- ((y1 mood) (x3 mood))
- ((x3 inflected) c )
- ((x0 inflected) ))
Spanish Morphology Generation
lex cantar mood ger
lex estar person 1 number sg tense
pres mood ind
estoy
38Hindi Resources
- Large statistical lexicon from the Linguistic
Data Consortium (LDC) - Parallel Corpus from LDC
- Morphological Analyzer-Generator from LDC
- Lots of native speakers
- Computational linguists with little or no
knowledge of Hindi - Experimented with the size of the parallel corpus
- Miserly and large scenarios
39Hindi
EBMT
Parallel Corpus
SMT
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
15,000 Noun Phrases from Penn TreeBank
Supported by DARPA TIDES
40Manual Transfer Rules Example
NP PP NP1 NP P Adj N
N1 ke eka aXyAya N
jIvana
NP NP1 PP Adj N
P NP one chapter of N1
N life
NP1 ke NP2 -gt NP2 of NP1 Ex jIvana ke
eka aXyAya life of (one) chapter
gt a chapter of life NP,12 NPNP PP
NP1 -gt NP1 PP ( (X1Y2) (X2Y1) ((x2
lexwx) 'kA') ) NP,13 NPNP NP1 -gt
NP1 ( (X1Y1) ) PP,12 PPPP NP Postp
-gt Prep NP ( (X1Y2) (X2Y1) )
41Hindi-English
Very miserly training data. Seven combinations of
components Strong decoder allows
re-ordering Three automatic scoring metrics
42Extra Slides
43The Avenue Low Resource Scenario
Elicitation
Rule Learning
Run-Time System
Rule Refinement
Morphology
Translation Correction Tool
Word-Aligned Parallel Corpus
INPUT TEXT
Run Time Transfer System
Rule Refinement Module
Elicitation Corpus
Decoder
Elicitation Tool
Lexical Resources
OUTPUT TEXT
44Feature Specification
- Defines Features and their values
- Sets default values for features
- Specifies feature requirements and restrictions
- Written in XML
45Feature Specification
- Feature c-copula-type(a copula is a verb like
be some languages do not have copulas) - Values
- copula-n/a Restrictions 1.
(c-secondary-type secondary-copula) Notes - copula-role Restrictions 1.
(c-secondary-type secondary-copula) Notes 1. A
role is something like a job or a function. "He
is a teacher" "This is a vegetable
peeler"copula-identity Restrictions 1.
(c-secondary-type secondary-copula) - Notes 1. "Clark Kent is Superman" "Sam is the
teacher" copula-location Restrictions 1.
(c-secondary-type secondary-copula) - Notes 1. "The book is on the table" There is a
long list of locative relations later in the
feature specification. copula-description Res
trictions 1. (c-secondary-type secondary-copula) - Notes 1. A description is an attribute. "The
children are happy." "The books are long."
46Feature Maps
- Some features interact in the grammar
- English s reflects person and number of the
subject and tense of the verb. - In expressing the English present progressive
tense, the auxiliary verb is in a different place
in a question and a statement - He is running.
- Is he running?
- We need to check many, but not all combinations
of features and values. - Using unlimited feature combinations leads to an
unmanageable number of sentences
47(No Transcript)
48Evidentiality Map
Lexical Aspect Assertiveness Polarity Source T
ense Gram. Aspect
activity-accomplishment
Assertiveness-asserted, Assetiveness-neutral
Polarity-positive, Polarity-negative
Hearsay, quotative, inferred, assumption
Visual, Auditory, non-visual-or-auditory
Past
Present, Future
Past
Present
Perfective, progressive, habitual, neutral
Perfective, progressive, habitual, neutral
habitual, neutral, progressive
habitual, neutral, progressive
49Current Work
- Navigation
- Start large search space of all possible feature
combinations - Finish each feature has been eliminated as
irrelevant or has been explored - Goal dynamically find the most efficient path
through the search space for each language.
50Current Work
- Feature Detection
- Which features have an effect on morphosyntax?
- What is the effect?
- Drives the Navigation process
51Feature Detection Spanish
- The girl saw a red book.
- ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
- La niña vió un libro rojo
- A girl saw a red book
- ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
- Una niña vió un libro rojo
- I saw the red book
- ((1,1)(2,2)(3,3)(4,5)(5,4))
- Yo vi el libro rojo
- I saw a red book.
- ((1,1)(2,2)(3,3)(4,5)(5,4))
- Yo vi un libro rojo
- Feature definiteness
- Values definite, indefinite
- Function-of- subj, obj
- Marked-on-head-of- no
- Marked-on-dependent yes
- Marked-on-governor no
- Marked-on-other no
- Add/delete-word no
- Change-in-alignment no
52Feature Detection Chinese
- Feature definiteness
- Values definite, indefinite
- Function-of- subject
- Marked-on-head-of- no
- Marked-on-dependent no
- Marked-on-governor no
- Add/delete-word yes
- Change-in-alignment no
- A girl saw a red book.
- ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
- ? ?? ?? ?? ? ?? ?? ? ? ?
- The girl saw a red book.
- ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
- ?? ?? ? ?? ??? ?
53Feature Detection Chinese
- I saw the red book
- ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
- ??? ?, ? ?? ?
- I saw a red book.
- ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))
- ? ?? ? ?? ??? ? ?
- Feature definitenes
- Values definite, indefinite
- Function-of- object
- Marked-on-head-of- no
- Marked-on-dependent no
- Marked-on-governor no
- Add/delete-word yes
- Change-in-alignment yes
54Feature Detection Hebrew
- Feature definiteness
- Values definite, indefinite
- Function-of- subj, obj
- Marked-on-head-of- yes
- Marked-on-dependent yes
- Marked-on-governor no
- Add-word no
- Change-in-alignment no
- A girl saw a red book.
- ((2,1) (3,2)(5,4)(6,3))
- ???? ???? ??? ????
- The girl saw a red book
- ((1,1)(2,1)(3,2)(5,4)(6,3))
- ????? ???? ??? ????
- I saw a red book.
- ((2,1)(4,3)(5,2))
- ????? ??? ????
- I saw the red book.
- ((2,1)(3,3)(3,4)(4,4)(5,3))
- ????? ?? ???? ?????
55Feature Detection Feeds into
- Corpus Navigation which minimal pairs to pursue
next. - Dont pursue gender in Mapudungun
- Do pursue definiteness in Hebrew
- Morphology Learning
- Morphological learner identifies the forms of the
morphemes - Feature detection identifies the functions
- Rule learning
- Rule learner will have to learn a constraint for
each morpho-syntactic marker that is discovered - E.g., Adjectives and nouns agree in gender,
number, and definiteness in Hebrew.