Towards automatic enrichment and analysis of linguistic data for low-density languages - PowerPoint PPT Presentation

1 / 77

About This Presentation

Title:

Towards automatic enrichment and analysis of linguistic data for low-density languages

Description:

Towards automatic enrichment and analysis of linguistic data for low-density languages Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 78

Provided by: F208

Learn more at: http://courses.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Towards automatic enrichment and analysis of linguistic data for low-density languages

1
Towards automatic enrichment and analysis of
linguistic data for low-density languages

Fei Xia
University of Washington
Joint work with William Lewis and Dan Jinguji

2
Motivation theoretical linguistics

For a particular language (e.g., Yaqui), find the
answers for the following questions
What is word order SVO, SOV, VSO, .?
Does it have double-object construction?
Can a coordinated phrase be discontinuous? (e.g.,
NP1 Verb and NP2)
.
We want to know the answers for hundreds of
languages.

3
Motivation computational linguistics

For a particular language, we want to build
a Part-of-speech tagger and a parser
Common approach create a treebank
a MT system
Common approach
collect parallel data
test translation divergence (Dorr, 1994 Fox
2002 Hwa et al, 2002)

4
Main ideas

Projecting structures from a resource-rich
language (e.g., English) to a low-density
language.
Tapping the large body of Web-based linguistic
data ? using ODIN dataset

5
Structure projection

Previous work
(Yarowsky Ngai, 2001) POS tags and NP
boundaries
(Xi Hwa, 2005) POS tags
(Hua et al., 2002) dependency structures
(Quirk et al., 2005) dependency structures
Our work
Projecting both dependency structures and phrase
structures
It does not require a large amount of parallel
data or hand-aligned data.
It can be applied to hundreds of languages.

6
Outline

Background IGT and ODIN
Data enrichment
Word alignment
Structure projection
Grammar extraction
Experiments
Conclusion and future work

7
Background IGT and ODIN
8
Interlinear Glossed Text (IGT)

Rhoddodd yr athro lyfr ir
bacjgem ddoe
Gave-3sg the teacher book to-the boy
yesterday
The teacher gave a book to the boy yesterday
(Bailyn, 2001)

9
ODIN

Online Database of Interlinear text
Storing and indexing IGT found in scholarly
documents on the Web
Searchable by language name, language family,
concept/gram, etc.
Current size
36439 instances
725 languages

10
(No Transcript)
11
Data Enrichment
12
The goal

Original IGT three lines
Enriched IGT
English phrase structure (PS), dependency
structure (DS)
Source PS and DS
Word alignment between source and English
translation

13
Three steps

Parse the English translation
Align the source sentence and its English
translation
Project the English PS and DS onto the source side

14
Step 1 Parsing the English translation
The teacher gave a book to the boy yesterday
15
Step 2 Word alignment
16
Source-gloss alignment
17
Gloss-translation alignment
18
Heuristic word aligner

Gave-3sg the teacher book to-the boy yesterday
The teacher gave a book to the boy yesterday
The aligner aligns two words if they have the
same root form.

19
Limitation of heuristic word aligner

1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
I caught the pig and the
cat

20
Statistical word aligner

GIZA package (Och and Ney, 2000)
It implemented IBM models (Brown et. al., 1993)
Widely used in statistical MT field
Parallel corpus formed by the gloss and
translation lines of all the IGT examples in ODIN.

21
Improving word aligner

Train both directions (gloss?trans, trans?gloss)
and combine the results
Split words in the gloss line into morphemes
1SG pig-NNOM.SG grasp-PST and
cat-NNOM.SG
? 1SG pig -NNOM SG grasp -PST and cat NNOM
-SG

22
Improving word aligner (cont)

Pedro-NOM Goyo-ACC yesterday horse-ACC
steal-PRFV-SAY-PRES
Pedro says Goyo has stolen the horse yesterday .

Add (x,x) sentence pairs (Pedro, Pedro)
(Goyo, Goyo) ..
23
Step 3 Projecting structures

Projecting DS
Previous work
(Hwa et. al, 2002)
(Quirk et. al, 2005)
Projecting PS

24
Projecting phrase structure
25
Projecting PS

Copy the English PS and remove all the unaligned
English words
Replace English words with corresponding source
words
Starting from the root, reorder children of each
node.
Attach unaligned source words

26
Starting with English PS
The teacher gave a book to the boy yesterday
27
Replacing English words
28
Reordering children
29
Calculating phrase spans
30
Reordering NP and VP
31
Removing VP
32
Removing a node in PS
33
After removing VP
34
Reordering VBD and NP
35
Removing NP
36
Merging IN and DT
37
Before reordering
38
After reordering
1 2 3 4 5 6 7
39
Reordering two children of x y1 and y2

Let Si be the phrase span of yi
S1 and S2 dont overlap reorder two nodes
according to the spans.
S1 ½ S2 remove y2
S1 ¾ S2 remove y1
S1 and S2 overlap, and neither is a strict subset
of the other remove both nodes.
If y1 and y2 are leaf nodes, merge them.

40
Attaching unaligned source words
y
yk
yj
yi
41
Information that can be extracted from enriched
IGT

Grammars for source language
Transfer rules
Examples with interesting properties (e.g.,
crossing dependencies)

42
Grammars
S ? VBD NP NP PP NP NP ? DT NN NP ? NN PP ? INDT
NN
43
Examples of crossing dependencies

Inepo kow-ta bwuise-k into
mis-ta
1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG
I caught the pig and the cat
(Martinez Fabian, 2006)

44
Examples of crossing dependencies
45
Examples of crossing dependencies
46
Outline

Background IGT and ODIN
Data enrichment
Experiments
Conclusion and future work

47
Experiments

Test on a small set of IGT examples for seven
languages
SVO German (GER) and Hausa (HUA)
SOV Korean (KKN) and Yaqui (YAQ)
VSO Irish (GLI) and Welsh (WLS)
VOS Malagasy (MEX)

48
Test set
Numbers in the last row come from the Ethnologue
(Gorden, 2005)
Human annotators checked system output and
corrected - English DS - word alignment
- source DS
49
Heuristic word aligner
? High precision, low recall.
50
Statistical word aligner training data
51
When gloss words are not split into morphemes
52
When gloss words are split into morphemes
A significant improvement 0.812 ? 0.909
53
When (x,x) pairs are added to training data
Adding (x,x) pairs 0.909 ? 0.919
Combining two word aligners 0.919 ? 0.928
54
Projection results
55
Oracle results with perfect English DS and/or
word alignment
Potential improvement 81.45 ? 90.64
56
Remaining errors

Oracle result 90.64
Manually checked 43 errors in German data
26 (60.5) due to translation divergence (e.g.,
head switching)
8 (18.6) due to mistakes of the projection
heuristics
9 (20.9) due to non-exact translation

57
An example of non-extract translation

der Antrag des oder der
Dozenten
the petition of-the.SG or of-the.PL
docent.MSC
the petition of the docent
(Daniels, 2001)

58
Extracted CFG for Yaqui

S ? NP VP 49/77
S ? VP 9/77
VP ? NP Verb 23/95
VP ? Verb 17/95
VP ? NP NP Verb 2/95
VP ? NP Verb CC NP 2/95
? Yaqui looks like an SOV language

59
Extracted CFGs
60
Conclusion

We present a methodology for projecting structure
(DS and PS) from English onto source data.
Applied to seven languages with promising
results
Word alignment 94.03
Source DS 81.45
Source DS (oracle) 90.64
From enriched data, we extract CFGs and examples
of crossing dependencies.

61
Future direction theoretical linguistics

For a particular language (e.g., Yaqui), find the
answers for the following questions
What is word order SVO, SOV, VSO, .?
Does it have double-object construction?
Can a coordinated phrase be discontinuous? (e.g.,
NP1 Verb and NP2)
.
Our plan
Improve current algorithms
Test our system on more languages

62
Future direction computational linguistics

For a particular language, we want to build
a Part-of-speech tagger and a parser
Our plan use enriched data as seed and
experiment with prototype-driven learning
stategies (Haghighi and Klein, 2006)
a MT system
Our plan
Use enriched data as seed, as in (Quirk and
Corston-Oliver, 2006)
Test translation divergence automatically for
dozens or even hundreds of languages.

63
Thank you
64
Backup slides
65
Structural queries on the source side

Find examples of double objects
Find examples of long distance wh-movements
Determine the word order between
subject and VP
noun and relative clause
verb and PP
?Need to know the structure of the source
sentence

66
Gloss-translation alignment

Both are in English
1SG pig-NNOM.SG grasp-PST and
cat-NNOM.SG
I caught the pig and the cat
We experimented with two word aligners

67
Combining two word aligners

(1) Combining the alignment output union,
intersection, refined.
(2) Add the aligned pairs produced by heuristic
word aligner to the training data
(3) Modify the heuristic word aligner so that two
words are aligned if
they have the same root form, or
they are good translations according to
translation model produced by GIZA
(3) yields modest gain (0.914, 0.919)
? 0.928

68
Projecting DS

Copy the English DS and remove all the unaligned
English words
Replace English words with corresponding source
words
Remove duplicates if any
Attach unaligned source words

69
Starting with English DS
The teacher gave a book to the boy yesterday
70
Replacing English words with source words
71
Removing duplicates
72
Attaching unaligned source words
The heuristics described in (Quirk et al., 2005)
yk
yj
yi
73
Summary of the DS projection algorithm
74
Links between the two structures
75
Links between two structures
76
Links between two structures
One can extract transfer rules, treelets etc.
77
MEX and GLI are VOS or VSO?