Towards automatic enrichment and analysis of linguistic data for low-density languages - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Towards automatic enrichment and analysis of linguistic data for low-density languages

Description:

Towards automatic enrichment and analysis of linguistic data for low-density languages Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 78
Provided by: F208
Category:

less

Transcript and Presenter's Notes

Title: Towards automatic enrichment and analysis of linguistic data for low-density languages


1
Towards automatic enrichment and analysis of
linguistic data for low-density languages
  • Fei Xia
  • University of Washington
  • Joint work with William Lewis and Dan Jinguji

2
Motivation theoretical linguistics
  • For a particular language (e.g., Yaqui), find the
    answers for the following questions
  • What is word order SVO, SOV, VSO, .?
  • Does it have double-object construction?
  • Can a coordinated phrase be discontinuous? (e.g.,
    NP1 Verb and NP2)
  • .
  • We want to know the answers for hundreds of
    languages.

3
Motivation computational linguistics
  • For a particular language, we want to build
  • a Part-of-speech tagger and a parser
  • Common approach create a treebank
  • a MT system
  • Common approach
  • collect parallel data
  • test translation divergence (Dorr, 1994 Fox
    2002 Hwa et al, 2002)

4
Main ideas
  • Projecting structures from a resource-rich
    language (e.g., English) to a low-density
    language.
  • Tapping the large body of Web-based linguistic
    data ? using ODIN dataset

5
Structure projection
  • Previous work
  • (Yarowsky Ngai, 2001) POS tags and NP
    boundaries
  • (Xi Hwa, 2005) POS tags
  • (Hua et al., 2002) dependency structures
  • (Quirk et al., 2005) dependency structures
  • Our work
  • Projecting both dependency structures and phrase
    structures
  • It does not require a large amount of parallel
    data or hand-aligned data.
  • It can be applied to hundreds of languages.

6
Outline
  • Background IGT and ODIN
  • Data enrichment
  • Word alignment
  • Structure projection
  • Grammar extraction
  • Experiments
  • Conclusion and future work

7
Background IGT and ODIN
8
Interlinear Glossed Text (IGT)
  • Rhoddodd yr athro lyfr ir
    bacjgem ddoe
  • Gave-3sg the teacher book to-the boy
    yesterday
  • The teacher gave a book to the boy yesterday
  • (Bailyn, 2001)

9
ODIN
  • Online Database of Interlinear text
  • Storing and indexing IGT found in scholarly
    documents on the Web
  • Searchable by language name, language family,
    concept/gram, etc.
  • Current size
  • 36439 instances
  • 725 languages

10
(No Transcript)
11
Data Enrichment
12
The goal
  • Original IGT three lines
  • Enriched IGT
  • English phrase structure (PS), dependency
    structure (DS)
  • Source PS and DS
  • Word alignment between source and English
    translation

13
Three steps
  • Parse the English translation
  • Align the source sentence and its English
    translation
  • Project the English PS and DS onto the source side

14
Step 1 Parsing the English translation
The teacher gave a book to the boy yesterday
15
Step 2 Word alignment
16
Source-gloss alignment
17
Gloss-translation alignment
18
Heuristic word aligner
  • Gave-3sg the teacher book to-the boy yesterday
  • The teacher gave a book to the boy yesterday
  • The aligner aligns two words if they have the
    same root form.

19
Limitation of heuristic word aligner
  • 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
  • I caught the pig and the
    cat

20
Statistical word aligner
  • GIZA package (Och and Ney, 2000)
  • It implemented IBM models (Brown et. al., 1993)
  • Widely used in statistical MT field
  • Parallel corpus formed by the gloss and
    translation lines of all the IGT examples in ODIN.

21
Improving word aligner
  • Train both directions (gloss?trans, trans?gloss)
    and combine the results
  • Split words in the gloss line into morphemes
  • 1SG pig-NNOM.SG grasp-PST and
    cat-NNOM.SG
  • ? 1SG pig -NNOM SG grasp -PST and cat NNOM
    -SG

22
Improving word aligner (cont)
  • Pedro-NOM Goyo-ACC yesterday horse-ACC
    steal-PRFV-SAY-PRES
  • Pedro says Goyo has stolen the horse yesterday .

Add (x,x) sentence pairs (Pedro, Pedro)
(Goyo, Goyo) ..
23
Step 3 Projecting structures
  • Projecting DS
  • Previous work
  • (Hwa et. al, 2002)
  • (Quirk et. al, 2005)
  • Projecting PS

24
Projecting phrase structure
25
Projecting PS
  • Copy the English PS and remove all the unaligned
    English words
  • Replace English words with corresponding source
    words
  • Starting from the root, reorder children of each
    node.
  • Attach unaligned source words

26
Starting with English PS
The teacher gave a book to the boy yesterday
27
Replacing English words
28
Reordering children
29
Calculating phrase spans
30
Reordering NP and VP
31
Removing VP
32
Removing a node in PS
33
After removing VP
34
Reordering VBD and NP
35
Removing NP
36
Merging IN and DT
37
Before reordering
38
After reordering
1 2 3 4 5 6 7
39
Reordering two children of x y1 and y2
  • Let Si be the phrase span of yi
  • S1 and S2 dont overlap reorder two nodes
    according to the spans.
  • S1 ½ S2 remove y2
  • S1 ¾ S2 remove y1
  • S1 and S2 overlap, and neither is a strict subset
    of the other remove both nodes.
  • If y1 and y2 are leaf nodes, merge them.

40
Attaching unaligned source words
y
yk
yj
yi
41
Information that can be extracted from enriched
IGT
  • Grammars for source language
  • Transfer rules
  • Examples with interesting properties (e.g.,
    crossing dependencies)

42
Grammars
S ? VBD NP NP PP NP NP ? DT NN NP ? NN PP ? INDT
NN
43
Examples of crossing dependencies
  • Inepo kow-ta bwuise-k into
    mis-ta
  • 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
  • I caught the pig and the cat
  • (Martinez Fabian, 2006)

44
Examples of crossing dependencies
45
Examples of crossing dependencies
46
Outline
  • Background IGT and ODIN
  • Data enrichment
  • Experiments
  • Conclusion and future work

47
Experiments
  • Test on a small set of IGT examples for seven
    languages
  • SVO German (GER) and Hausa (HUA)
  • SOV Korean (KKN) and Yaqui (YAQ)
  • VSO Irish (GLI) and Welsh (WLS)
  • VOS Malagasy (MEX)

48
Test set
Numbers in the last row come from the Ethnologue
(Gorden, 2005)
Human annotators checked system output and
corrected - English DS - word alignment
- source DS
49
Heuristic word aligner
? High precision, low recall.
50
Statistical word aligner training data
51
When gloss words are not split into morphemes
52
When gloss words are split into morphemes
A significant improvement 0.812 ? 0.909
53
When (x,x) pairs are added to training data
Adding (x,x) pairs 0.909 ? 0.919
Combining two word aligners 0.919 ? 0.928
54
Projection results
55
Oracle results with perfect English DS and/or
word alignment
Potential improvement 81.45 ? 90.64
56
Remaining errors
  • Oracle result 90.64
  • Manually checked 43 errors in German data
  • 26 (60.5) due to translation divergence (e.g.,
    head switching)
  • 8 (18.6) due to mistakes of the projection
    heuristics
  • 9 (20.9) due to non-exact translation

57
An example of non-extract translation
  • der Antrag des oder der
    Dozenten
  • the petition of-the.SG or of-the.PL
    docent.MSC
  • the petition of the docent
  • (Daniels, 2001)

58
Extracted CFG for Yaqui
  • S ? NP VP 49/77
  • S ? VP 9/77
  • VP ? NP Verb 23/95
  • VP ? Verb 17/95
  • VP ? NP NP Verb 2/95
  • VP ? NP Verb CC NP 2/95
  • ? Yaqui looks like an SOV language

59
Extracted CFGs
60
Conclusion
  • We present a methodology for projecting structure
    (DS and PS) from English onto source data.
  • Applied to seven languages with promising
    results
  • Word alignment 94.03
  • Source DS 81.45
  • Source DS (oracle) 90.64
  • From enriched data, we extract CFGs and examples
    of crossing dependencies.

61
Future direction theoretical linguistics
  • For a particular language (e.g., Yaqui), find the
    answers for the following questions
  • What is word order SVO, SOV, VSO, .?
  • Does it have double-object construction?
  • Can a coordinated phrase be discontinuous? (e.g.,
    NP1 Verb and NP2)
  • .
  • Our plan
  • Improve current algorithms
  • Test our system on more languages

62
Future direction computational linguistics
  • For a particular language, we want to build
  • a Part-of-speech tagger and a parser
  • Our plan use enriched data as seed and
    experiment with prototype-driven learning
    stategies (Haghighi and Klein, 2006)
  • a MT system
  • Our plan
  • Use enriched data as seed, as in (Quirk and
    Corston-Oliver, 2006)
  • Test translation divergence automatically for
    dozens or even hundreds of languages.

63
Thank you
64
Backup slides
65
Structural queries on the source side
  • Find examples of double objects
  • Find examples of long distance wh-movements
  • Determine the word order between
  • subject and VP
  • noun and relative clause
  • verb and PP
  • ?Need to know the structure of the source
    sentence

66
Gloss-translation alignment
  • Both are in English
  • 1SG pig-NNOM.SG grasp-PST and
    cat-NNOM.SG
  • I caught the pig and the cat
  • We experimented with two word aligners

67
Combining two word aligners
  • (1) Combining the alignment output union,
    intersection, refined.
  • (2) Add the aligned pairs produced by heuristic
    word aligner to the training data
  • (3) Modify the heuristic word aligner so that two
    words are aligned if
  • they have the same root form, or
  • they are good translations according to
    translation model produced by GIZA
  • (3) yields modest gain (0.914, 0.919)
    ? 0.928

68
Projecting DS
  • Copy the English DS and remove all the unaligned
    English words
  • Replace English words with corresponding source
    words
  • Remove duplicates if any
  • Attach unaligned source words

69
Starting with English DS
The teacher gave a book to the boy yesterday
70
Replacing English words with source words
71
Removing duplicates
72
Attaching unaligned source words
The heuristics described in (Quirk et al., 2005)
yk
yj
yi
73
Summary of the DS projection algorithm
74
Links between the two structures
75
Links between two structures
76
Links between two structures
One can extract transfer rules, treelets etc.
77
MEX and GLI are VOS or VSO?
  • MEX
  • S ? VP 90/102
  • S ? Verb 11/102
  • GLI
  • S ? VP NP 22/41
  • S ? Verb 19/41
Write a Comment
User Comments (0)
About PowerShow.com