NICE: Native Language Interpretation and Communication Environment - PowerPoint PPT Presentation

About This Presentation
Title:

NICE: Native Language Interpretation and Communication Environment

Description:

Rapid development of machine translation for low and very low ... kudu.le.me.we.la.n. lay_down.st.Hh.rem.neg.ind.1S. I am not going to lay down there any more ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 53
Provided by: loril8
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: NICE: Native Language Interpretation and Communication Environment


1
NICE Native Language Interpretation and
Communication Environment
  • Lori Levin, Jaime Carbonell, Alon Lavie, Ralf
    Brown, Erik Peterson, Katharina Probst, Rodolfo
    Vega, Hal Daume
  • Language Technologies Institute
  • Carnegie Mellon University
  • April 12, 2001

2
NICE
  • Rapid development of machine translation for low
    and very low density languages

3
Classification of MT by Language Density
  • High density pairs (E-F, E-S, E-J, )
  • Statistical or traditional MT approaches are O.K.
  • Medium density (E-Czech, E-Croatian, )
  • Example-based MT (success with Croatian, Korean)
  • JHU initial success with stat-MT (Czech)
  • Low density (S-Mapudungun, E-Iñupiaq, )
  • 10,000 to 1 million speakers
  • Insufficient bilingual corpora for SMT, EBMT
  • Partial corpus-based resources
  • Insufficient trained computational linguists

4
Machine Translation of Very Low Density Languages
  • No text in electronic form
  • Cant apply current methods for statistical MT
  • No standard spelling or orthography
  • Few literate native speakers
  • Few linguists familiar with the language
  • Nobody is available to do rule-based MT
  • Not enough money or time for years of linguistic
    information gathering/analysis
  • E.g., Siona (Colombia)

5
Motivation for LDMT
  • Methods developed for languages with very scarce
    resources will generalize to all MT.
  • Policy makers can get input from indigenous
    people.
  • E.g., Has there been an epidemic or a crop
    failure
  • Indigenous people can participate in government,
    education, and internet without losing their
    language.
  • First MT of polysynthetic languages

6
New Ideas
  • MT without large amounts of text and without
    trained linguists
  • Machine learning of rule-based MT
  • Multi-Engine architecture can flexibly take
    advantage of whatever resources are available.
  • Research partnerships with indigenous communities
  • (Future Exponential models for data-miserly SMT)

7
History of NICE
  • Arose from a series of joint workshops of NSF and
    OAS-CICAD.
  • Workshop recommendations
  • Create multinational projects using information
    technology to
  • provide immediate benefits to governments and
    citizens
  • develop critical infrastructure for communication
    and collaborative research
  • training researchers and engineers
  • advancing science and technology

8
Approach
  • Machine learning
  • Uncontrolled corpus (Generalized Example-Based
    MT)
  • Controlled corpus elicited from native speakers
    (Version Space Learning)
  • Multi-Engine MT
  • Flexibly adapt to whatever resources are
    available
  • Take advantage of the strengths of different MT
    approaches

9
Evaluation Objective
  • To achieve a given level of translation quality
    for a series of languages L1 to Ln
  • Reduce the amount of training data required
  • Reduce the amount of language-specific
    development time after language-independent
    software has been developed

10
Evaluation Baseline From Previous Work
(Generalized EBMT)
  • High density languages (French, Spanish)
  • 1MW parallel corpora (e.g., subset of Hansards)
  • Consistent spelling, grammatically correct
  • High coverage, gisting-quality translation

11
Evaluation Baseline GEBMT French Hansards
Coverage (in percent) as a function of corpus
size (in millions of words)
12
Long-Term Target Reduction in Linguistic and
Human Resources
13
Work Completed
14
Establishing Partnerships
15
NICE Partners
16
Nice/MapudungunCurrent Products
  • Writing conventions (Grafemario)
  • Glossary Mapudungun/Spanish
  • Bilingual newspaper, 4 issues
  • Ultimas Familias memoirs
  • Memorias de Pascual Coña
  • 6 hours transcribed speech
  • 40 hours recorded speech

17
Instructible Rule-Based MT
18
iRBMT Instructible Rule Based MT
19
Elicitation Process
  • Purpose controlled elicitation of data that will
    be input to machine learning of translation rules

20
Elicitation Interface Example
21
Elicitation Interface
  • Native informant sees source language sentence
    (in English or Spanish)
  • Native informant types in translation, then uses
    mouse to add word alignments
  • Informant is
  • Literate
  • Bilingual
  • Not an expert in linguistics or in linguistics or
    computation

22
The Learning Process
  • Learning Instance
  • English the big boy Hebrew ha-yeled ha-gadol
  • Acquired Transfer Rule
  • Hebrew NP N ADJ ltgt English NP the ADJ
    N
  • where (HebrewN ltgt English N)
  • (HebrewADJ ltgt EnglishADJ)
  • (HebrewN has ((def )))
  • (HebrewADJ has ((def )))

23
Standard Version Space Learning
  • Hypothesis Space of all possible rules consistent
    with data seen so far
  • Represented by a generalization lattice bounded
    by S (most specific) and G (most general)
    boundaries
  • New positive instances (translation pairs)
    generalize S
  • New negative instances (incorrect translations)
    specialize G
  • Converge when S and G intersect
  • Problem worse case exponential blow-up

24
Locally-Constrained, Seeded Version Spaces
  • Preferred generalization level (e.g.
    Parts-of-speech linguistic features semantic
    features)
  • First translation pair generalized to preferred
    level gt seed the VS
  • Define P max levels of seed generalization or
    specialization (i.e. how close is initial guess)
  • Generate S/P and G/P boundaries, and apply VS
    learning
  • Allow mutation operator if S/P and G/P prove
    incorrect

25
Advantages of Seeded Version Spaces
  • Worst case polynomial with degree P gt
    "tractable"
  • Generalization level can be estimated reasonably
    well for MT transfer rules gt good seeds
  • Faster convergence, requiring less training data

26
Version Space Abstraction Lattice
27
The Elicitation Corpus
  • List of sentences in a major language
  • English
  • Spanish
  • Dynamically adaptable
  • Different sentences are presented depending on
    what was previously elicited
  • Compositional
  • Joe, Joes brother, I saw Joes brother, I told
    you that I saw Joes brother, etc.
  • Aim for typological completeness
  • Cover all types of languages

28
Pilot Version of Elicitation Corpus
  • Approximately 800 sentences
  • Tested on Swahili
  • Vocabulary
  • Include a variety of semantic classes e.g.,
    animate, inanimate, man-made objects, natural
    objects, etc.
  • Noun phrases
  • Detect number, gender, types of possessives,
    classifiers, etc.
  • Basic sentences
  • Detect agreement between verb and subject and/or
    object, basic word order, problems with
    indefinite or inanimate subjects, etc.
  • Complex constructions
  • Currently relative clauses. Later, comparatives,
    questions, embedded clauses, etc.

29
Detection of Grammatical Features
  • Each language uses a different inventory of
    grammatical features tense, number, person,
    agreement.

Swahili The hunter kill-ed the animal Mwindaji
a-li-mu-ua mnyama a class-one subject li past
tense mu class-one object ua kill
Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him
Ne-waapam-ek-wa me-see-indirect-he
30
Organization of Tests
Dual
Plural
Paucal
Diagnostic Tests

Subj-V Agr


31
Demo of Elicitation Interface and Feature
Detection
32
Data Collection
33
Mapudungun Data
  • Spanish-Mapudungun parallel corpora
  • Total words 223,366
  • Spanish-Mapudungun glossary
  • About 5500 entries
  • 40 hours of speech recorded
  • 6 hours of speech transcribed
  • Speech data will be translated into Spanish

34
Progress and Plans
35
Summary of Year 1Partnerships
  • Establishment of a partnership with the Institute
    for Indigenous Studies at the Universidad de la
    Frontera (UFRO) in Chile.
  • Establishment of a partnership with the Chilean
    Ministry of Education.
  • Identified partners in Alaska and Colombia.
    Details of the partnership are being discussed.

36
Summary of Year 1 Data
  • Spanish-Mapudungun parallel corpus over 200,000
    words
  • Standardization of orthography Linguists at UFRO
    have evaluated the competing orthographies for
    Mapudungun and written a report detailing their
    recommendations for a standardized orthography
    for NICE.
  • Training for spoken language collection In
    January 2001 native speakers of Mapudungun were
    trained in the recording and transcription of
    spoken data.
  • Mapudungun spoken language corpus 40 hours
    recorded, 6 hours transcribed (as of end of
    February).

37
Summary of Year 1 iKBMT
  • Preliminary design of transfer rule formalism for
    machine translation.
  • Design and pilot testing of prototype elicitation
    corpus.
  • First prototype of feature detection
  • Morphological processing in PC Kimmo covering
    about 40 Mapudungun morphemes.
  • Preliminary version of new parser for run-time
    translation component.

38
Goals for Year 2 Data
  • Continue collection, transcription, and
    translation of Mapudungun data.
  • Take inventory of existing Inupiaq data available
    from the Alaska Native Languages Center and the
    Inupiaq community.
  • Focus on the North Slope dialect and other
    dialects that are easily intelligible to North
    Slope speakers. Type and record additional
    Inupiaq data as needed.
  • Plans for Siona data collection will be discussed
    at a meeting in Bogota in May.

39
Goals for Year 2 Elicitation Corpus
  • Extend the elicitation corpus with more complex
    constructions (such as causatives and
    comparatives) and add diagnostics for complex
    features such as the tense and aspect system.
  • Refine elicitation interface based on preliminary
    experiments.
  • Preliminary user studies with the corpus and
    interface using at least two languages.
  • Refine the linguistic corpus so as to accelerate
    learning of the more common and useful structures
    first.

40
Goals for Year 2 EBMT
  • Baseline EBMT systems for Mapudungun and Inupiaq.
  • Extend baseline systems with preliminary version
    of linguistic generalization.

41
Goals for Year 2 MT Run-time System
  • Develop learnable transfer-rule structure and
    interpreter.
  • Unlike existing hand-coded transfer system for
    machine translation, a learnable structure
    requires full compositionality and
    component-wise generalizability/specializability
    for data-driven inductive learning.
  • Develop morphological processors and part of
    speech taggers for Mapudungun and Spanish.

42
Goals for Year 2 Version Space Learning
  • Develop baseline Seeded-Version-Space (SVS)
    inductive learning method
  • Extend the elicitation interface to enable the
    SVS system to generate questions for the native
    informant, so as to speed the transfer-rule
    learning process

43
Future Projects
  • Discussion

44
Appendix
45
The IEI Team
  • Coordinator (leader of a bilingual and
    multicultural education project)
  • Distinguished native speaker
  • Linguists (one native speaker, one near-native)
  • Typists/Transcribers
  • Recording assistants
  • Translators
  • Native speaker linguistic informants

46
Agreement Between LTI and Institute of Indigenous
Studies (IEI), Universidad De La Frontera, Chile
  • Contributions of IEI
  • Socio-linguistic knowledge
  • Linguistic knowledge
  • Experience in multicultural bilingual education
  • The use of IEI facilities, faculty/researchers
    and staff for the project
  • electronic network support and computer technical
    support

47
Agreement between LTI and Institute of Indigenous
Studies (IEI), Universidad de la Frontera, Chile
  • Contributions of LTI
  • Equipment four computers and four DAT recorders
  • Payment of consulting fees pending funding from
    the Chilean Ministry of Education
  • Expertise in language technologies

48
LTI/IEI Agreement
  • Cooperate in expanding the project to convergent
    areas, such as bilingual education, as well as in
    pursuing additional funding

49
MINEDUC/IEIAgreement Highlights
  • Based on the LTI/IEI agreement, the Chilean
    Ministry of Education got involved in funding the
    data collection and processing team for the year
    2001. This agreement will be renewed each year,
    as needed.

50
MINEDUC/IEI Agreement
  • Objectives
  • To evaluate the NICE/Mapudungun proposal for
    orthography and spelling
  • To collect an oral corpus that represent the four
    Mapudungun dialects spoken in Chile. The main
    domain is primary health, traditional and
    Occidental.

51
MINEDUC/IEI Agreement
  • Deliverables
  • An oral corpus of 800 hours recorded,
    proportional to the demography of each current
    spoken dialect
  • 120 hours transcribed and translated from
    Mapudungun to Spanish
  • A refined proposal for writing Mapudungun

52
Mapudungun Morphology
  • kudu.le.me.we.la.n
  • lay_down.st.Hh.rem.neg.ind.1S
  • I am not going to lay down there any more
  • illku.faluw.kUle.n
  • get_angry.SIM.ST.IND.1s
  • I am pretending to be angry
  • antU.kUdaw.kiaw.ke.rke.fu.y
  • day.work.CIRC.CF.REP.IPD.IND.3s
  • he used to work here and there as a day laborer,
    I am told
  • wisa.ka.dungu.fe.nge.y.mi
  • bad.VERB.FAC.speak.NOM.VERB.IND.2s
  • you are someone who always does and says nasty
    things
Write a Comment
User Comments (0)
About PowerShow.com