MT For Low-Density Languages - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

MT For Low-Density Languages

Description:

What is 'Low Density'? What is 'Low Density'? In NLP, languages ... What is 'Low Density' ... e.g. orthography without unicode entries. What are our options? ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 39
Provided by: ryang9
Category:

less

Transcript and Presenter's Notes

Title: MT For Low-Density Languages


1
MT For Low-Density Languages
  • Ryan Georgi
  • Ling 575 MT Seminar
  • Winter 2007

2
What is Low Density?
3
What is Low Density?
  • In NLP, languages are usually chosen for
  • Economic Value
  • Ease of development
  • Funding (NSA, anyone?)

4
What is Low Density?
  • As a result, NLP work until recently has focused
    on a rather small set of languages.
  • e.g. English, German, French, Japanese, Chinese

5
What is Low Density?
  • Density refers to the availability of resources
    (primarily digital) for a given language.
  • Parallel text
  • Treebanks
  • Dictionaries
  • Chunked, semantically tagged, or other annotation

6
What is Low Density?
  • Density not necessarily linked to speaker
    population
  • Our favorite example, Iniktitut

7
So, why study LDL?
8
So, why study LDL?
  • Preserving endangered languages
  • Spreading benefits of NLP to other populations
  • (Tegic has T9 for Azerbaijani now)
  • Benefits of wide typological coverage for
    cross-linguistic research
  • (?)

9
Problem of LDL?
10
Problem of LDL?
  • The fundamental problem for annotation of
    lower-density languages is that they are lower
    density Maxwell Hughes
  • Easiest NLP development (and often best) done
    with statistical methods
  • Training requires lots of resources
  • Resources require lots of money
  • Cost/Benefit chicken and the egg

11
What are our options?
  • Create corpora by hand
  • Very time-consuming ( expensive)
  • Requires trained native speakers
  • Digitize printed resources
  • Time-consuming
  • May require trained native speakers
  • e.g. orthography without unicode entries

12
What are our options?
  • Traditional requirements are going to be
    difficult to satisfy, no matter how we slice it.
  • We need to, then
  • Maximize information extracted from resources we
    can get
  • Reduce requirements for building a system

13
Maximizing Information with IGT
14
Maximizing Information with IGT
  • Interlinear Glossed Text
  • Traditional form of transcription for linguistic
    field researchers and grammarians
  • Example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

15
Benefits of IGT
  • As IGT is frequently used in fieldwork, it is
    often available for low-density languages
  • IGT provides information about syntax,
    morphology,
  • The translation line is usually a high-density
    language that we can use as a pivot language.

16
Drawbacks of IGT
  • Data can be abormal in a number of ways
  • Usually quite short
  • May be used by grammarian to illustrate fringe
    usages
  • Often purposely limited vocabularies
  • Still, in working with LDL it might be all weve
    got

17
Utilizing IGT
  • First, a big nod to Fei (this is her paper!)
  • As we saw in HW2, word alignment is hard.
  • IGT, however, often gets us halfway there!

18
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

19
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

20
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

21
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

22
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday

23
Utilizing IGT
  • Take the previous example
  • Rhoddodd yr athro lyfr Ir bachgen ddoe
  • gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday
  • The interlinear already aligns the source with
    the gloss
  • Often, the gloss uses words found in the
    translation already

24
Utilizing IGT
  • Alignment isnt always this easy
  • xaraju mina lgurfati wa nah.nu
    nadxulu
  • xaraj-u mina ?al-gurfat-i wa nah.nu
    na-dxulu
  • exited-3MPL from DEF-room-GEN and we
    1PL-enter
  • 'They left the room as we were entering it
  • (Source Modern Arabic Structures, Functions,
    and Varieties Clive Holes)

25
Utilizing IGT
  • Alignment isnt always this easy
  • xaraju mina lgurfati wa nah.nu
    nadxulu
  • xaraj-u mina ?al-gurfat-i wa nah.nu
    na-dxulu
  • exited-3MPL from DEF-room-GEN and we
    1PL-enter
  • 'They left the room as we were entering it
  • (Source Modern Arabic Structures, Functions,
    and Varieties Clive Holes)
  • We can get a little more by stemming

26
Utilizing IGT
  • Alignment isnt always this easy
  • xaraju mina lgurfati wa nah.nu
    nadxulu
  • xaraj-u mina ?al-gurfat-i wa nah.nu
    na-dxulu
  • exited-3MPL from DEF-room-GEN and we
    1PL-enter
  • 'They left the room as we were entering it
  • (Source Modern Arabic Structures, Functions,
    and Varieties Clive Holes)
  • We can get a little more by stemming
  • but were going to need more.

27
Utilizing IGT
  • Thankfully, with an English translation, we
    already have tools to get phrase and dependency
    structures that we can project
  • (Source Will Feis NAACL 2007 Paper!)

28
Utilizing IGT
  • Thankfully, with an English translation, we
    already have tools to get phrase and dependency
    structures that we can project
  • (Source Will Feis NAACL 2007 Paper!)

29
Utilizing IGT
  • What can we get from this?
  • Automatically generated CFGs
  • Can infer word order from these CFGs
  • Can infer possible constituents
  • suggestions?
  • From a small amount of data, this is a lot of
    information, but what about

30
Reducing data Requirements with Prototyping
31
Grammar Induction
  • So, we have a way to get production rules from a
    small amount of data.
  • Is this enough?
  • Probably not.
  • CFGs arent known for their robustness
  • How about using what we have as a bootstrap?

32
Grammar Induction
  • Given unannotated text, we can derive PCFGs
  • Without annotation, though, we just have
    unlabelled trees
  • ROOT
  • C2
  • X0 X1 Y2
  • the dog Z3 N4
  • fell asleep
  • Such an unlabelled parse doesnt give us S -gt NP
    VP, though.

p0.02
p0.45e-4
p0.003
p5.3e-2
p0.09
33
Grammar Induction
  • Can we get labeled trees without annotated text?
  • Haghighi Klein (2006)
  • Propose a way in which production rules can be
    passed to a PCFG induction algorithm as
    prototypical constituents
  • Think of these prototypes as a rubric that could
    be given to a human annotator
  • e.g. for English, NP -gt DT NN

34
Grammar Induction
  • Lets take the possible constituent DT NN
  • We could tell our PCFG algorithm to apply this as
    a constituent everywhere it occurs
  • But what about DT NN NN? (the train station)?
  • We would like to catch this as well

35
Grammar Induction
  • KHs solution?
  • distributional clustering
  • a similarity measure between two items on the
    basis of their immediate left and right contexts
  • to be honest, I lose them in the math here.
  • Importantly, however, weighting the probability
    of a constituent with the right measure improves
    from the base unsupervised level of f-measure
    35.3 to 62.2

36
So what now?
37
Next Steps
  • By extracting production rules from a very small
    amount of data using IGT and using Haghighi
    Kleins unsupervised methods, it may be possible
    to bootstrap an effective language model from
    very little data!

38
Next Steps
  • Possible applications
  • Automatic generation of language resources
  • (While a system with the same goals would only
    compound error, automatically annotated data
    could be easier for a human to correct rather
    than hand-generate)
  • Assist linguists in the field
  • (Better model performance could imply better
    grammar coverage)
  • you tell me!
Write a Comment
User Comments (0)
About PowerShow.com