Title: Towards automatic enrichment and analysis of linguistic data for low-density languages
1Towards automatic enrichment and analysis of
linguistic data for low-density languages
- Fei Xia
- University of Washington
- Joint work with William Lewis and Dan Jinguji
2Motivation theoretical linguistics
- For a particular language (e.g., Yaqui), find the
answers for the following questions - What is word order SVO, SOV, VSO, .?
- Does it have double-object construction?
- Can a coordinated phrase be discontinuous? (e.g.,
NP1 Verb and NP2) - .
- We want to know the answers for hundreds of
languages.
3Motivation computational linguistics
- For a particular language, we want to build
- a Part-of-speech tagger and a parser
- Common approach create a treebank
- a MT system
- Common approach
- collect parallel data
- test translation divergence (Dorr, 1994 Fox
2002 Hwa et al, 2002)
4Main ideas
- Projecting structures from a resource-rich
language (e.g., English) to a low-density
language. - Tapping the large body of Web-based linguistic
data ? using ODIN dataset
5Structure projection
- Previous work
- (Yarowsky Ngai, 2001) POS tags and NP
boundaries - (Xi Hwa, 2005) POS tags
- (Hua et al., 2002) dependency structures
- (Quirk et al., 2005) dependency structures
- Our work
- Projecting both dependency structures and phrase
structures - It does not require a large amount of parallel
data or hand-aligned data. - It can be applied to hundreds of languages.
6Outline
- Background IGT and ODIN
- Data enrichment
- Word alignment
- Structure projection
- Grammar extraction
- Experiments
- Conclusion and future work
7Background IGT and ODIN
8Interlinear Glossed Text (IGT)
- Rhoddodd yr athro lyfr ir
bacjgem ddoe - Gave-3sg the teacher book to-the boy
yesterday - The teacher gave a book to the boy yesterday
- (Bailyn, 2001)
9ODIN
- Online Database of Interlinear text
- Storing and indexing IGT found in scholarly
documents on the Web - Searchable by language name, language family,
concept/gram, etc. - Current size
- 36439 instances
- 725 languages
10(No Transcript)
11Data Enrichment
12The goal
- Original IGT three lines
- Enriched IGT
- English phrase structure (PS), dependency
structure (DS) - Source PS and DS
- Word alignment between source and English
translation
13Three steps
- Parse the English translation
- Align the source sentence and its English
translation - Project the English PS and DS onto the source side
14Step 1 Parsing the English translation
The teacher gave a book to the boy yesterday
15Step 2 Word alignment
16Source-gloss alignment
17Gloss-translation alignment
18Heuristic word aligner
- Gave-3sg the teacher book to-the boy yesterday
- The teacher gave a book to the boy yesterday
- The aligner aligns two words if they have the
same root form.
19Limitation of heuristic word aligner
- 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
-
- I caught the pig and the
cat
20Statistical word aligner
- GIZA package (Och and Ney, 2000)
- It implemented IBM models (Brown et. al., 1993)
- Widely used in statistical MT field
- Parallel corpus formed by the gloss and
translation lines of all the IGT examples in ODIN.
21Improving word aligner
- Train both directions (gloss?trans, trans?gloss)
and combine the results - Split words in the gloss line into morphemes
- 1SG pig-NNOM.SG grasp-PST and
cat-NNOM.SG - ? 1SG pig -NNOM SG grasp -PST and cat NNOM
-SG
22Improving word aligner (cont)
- Pedro-NOM Goyo-ACC yesterday horse-ACC
steal-PRFV-SAY-PRES - Pedro says Goyo has stolen the horse yesterday .
Add (x,x) sentence pairs (Pedro, Pedro)
(Goyo, Goyo) ..
23Step 3 Projecting structures
- Projecting DS
- Previous work
- (Hwa et. al, 2002)
- (Quirk et. al, 2005)
- Projecting PS
24Projecting phrase structure
25Projecting PS
- Copy the English PS and remove all the unaligned
English words - Replace English words with corresponding source
words - Starting from the root, reorder children of each
node. - Attach unaligned source words
26Starting with English PS
The teacher gave a book to the boy yesterday
27Replacing English words
28Reordering children
29Calculating phrase spans
30Reordering NP and VP
31Removing VP
32Removing a node in PS
33After removing VP
34Reordering VBD and NP
35Removing NP
36Merging IN and DT
37Before reordering
38After reordering
1 2 3 4 5 6 7
39Reordering two children of x y1 and y2
- Let Si be the phrase span of yi
- S1 and S2 dont overlap reorder two nodes
according to the spans. - S1 ½ S2 remove y2
- S1 ¾ S2 remove y1
- S1 and S2 overlap, and neither is a strict subset
of the other remove both nodes. - If y1 and y2 are leaf nodes, merge them.
40Attaching unaligned source words
y
yk
yj
yi
41Information that can be extracted from enriched
IGT
- Grammars for source language
- Transfer rules
- Examples with interesting properties (e.g.,
crossing dependencies)
42Grammars
S ? VBD NP NP PP NP NP ? DT NN NP ? NN PP ? INDT
NN
43Examples of crossing dependencies
- Inepo kow-ta bwuise-k into
mis-ta - 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
- I caught the pig and the cat
- (Martinez Fabian, 2006)
44Examples of crossing dependencies
45Examples of crossing dependencies
46Outline
- Background IGT and ODIN
- Data enrichment
- Experiments
- Conclusion and future work
47Experiments
- Test on a small set of IGT examples for seven
languages - SVO German (GER) and Hausa (HUA)
- SOV Korean (KKN) and Yaqui (YAQ)
- VSO Irish (GLI) and Welsh (WLS)
- VOS Malagasy (MEX)
48Test set
Numbers in the last row come from the Ethnologue
(Gorden, 2005)
Human annotators checked system output and
corrected - English DS - word alignment
- source DS
49Heuristic word aligner
? High precision, low recall.
50Statistical word aligner training data
51When gloss words are not split into morphemes
52When gloss words are split into morphemes
A significant improvement 0.812 ? 0.909
53When (x,x) pairs are added to training data
Adding (x,x) pairs 0.909 ? 0.919
Combining two word aligners 0.919 ? 0.928
54Projection results
55Oracle results with perfect English DS and/or
word alignment
Potential improvement 81.45 ? 90.64
56Remaining errors
- Oracle result 90.64
- Manually checked 43 errors in German data
- 26 (60.5) due to translation divergence (e.g.,
head switching) - 8 (18.6) due to mistakes of the projection
heuristics - 9 (20.9) due to non-exact translation
57An example of non-extract translation
- der Antrag des oder der
Dozenten - the petition of-the.SG or of-the.PL
docent.MSC - the petition of the docent
- (Daniels, 2001)
58Extracted CFG for Yaqui
- S ? NP VP 49/77
- S ? VP 9/77
- VP ? NP Verb 23/95
- VP ? Verb 17/95
- VP ? NP NP Verb 2/95
- VP ? NP Verb CC NP 2/95
- ? Yaqui looks like an SOV language
59Extracted CFGs
60Conclusion
- We present a methodology for projecting structure
(DS and PS) from English onto source data. - Applied to seven languages with promising
results - Word alignment 94.03
- Source DS 81.45
- Source DS (oracle) 90.64
- From enriched data, we extract CFGs and examples
of crossing dependencies.
61Future direction theoretical linguistics
- For a particular language (e.g., Yaqui), find the
answers for the following questions - What is word order SVO, SOV, VSO, .?
- Does it have double-object construction?
- Can a coordinated phrase be discontinuous? (e.g.,
NP1 Verb and NP2) - .
- Our plan
- Improve current algorithms
- Test our system on more languages
62Future direction computational linguistics
- For a particular language, we want to build
- a Part-of-speech tagger and a parser
- Our plan use enriched data as seed and
experiment with prototype-driven learning
stategies (Haghighi and Klein, 2006) - a MT system
- Our plan
- Use enriched data as seed, as in (Quirk and
Corston-Oliver, 2006) - Test translation divergence automatically for
dozens or even hundreds of languages.
63Thank you
64Backup slides
65Structural queries on the source side
- Find examples of double objects
- Find examples of long distance wh-movements
- Determine the word order between
- subject and VP
- noun and relative clause
- verb and PP
- ?Need to know the structure of the source
sentence
66Gloss-translation alignment
- Both are in English
- 1SG pig-NNOM.SG grasp-PST and
cat-NNOM.SG - I caught the pig and the cat
- We experimented with two word aligners
67Combining two word aligners
- (1) Combining the alignment output union,
intersection, refined. - (2) Add the aligned pairs produced by heuristic
word aligner to the training data - (3) Modify the heuristic word aligner so that two
words are aligned if - they have the same root form, or
- they are good translations according to
translation model produced by GIZA - (3) yields modest gain (0.914, 0.919)
? 0.928
68Projecting DS
- Copy the English DS and remove all the unaligned
English words - Replace English words with corresponding source
words - Remove duplicates if any
- Attach unaligned source words
69Starting with English DS
The teacher gave a book to the boy yesterday
70Replacing English words with source words
71Removing duplicates
72Attaching unaligned source words
The heuristics described in (Quirk et al., 2005)
yk
yj
yi
73Summary of the DS projection algorithm
74Links between the two structures
75Links between two structures
76Links between two structures
One can extract transfer rules, treelets etc.
77MEX and GLI are VOS or VSO?
- MEX
- S ? VP 90/102
- S ? Verb 11/102
- GLI
- S ? VP NP 22/41
- S ? Verb 19/41