Title: Catch the Link Combining Clues for Word Alignment
1Catch the Link! Combining Clues for Word
Alignment
- Jörg Tiedemann
- Uppsala University
- joerg_at_stp.ling.uu.se
2Outline
- Background
- What do we want?
- What do we have?
- What do we need?
- Clue Alignment
- What is a clue?
- How do we find clues?
- How do we use clues?
- What do we get?
3What do we want?
- automatically
- language independent
Source
Aligned corpus
Word aligner
Sentence aligner
Parallel corpus
Trans- lation 1
Token links
Trans- lation 2
Type links
4What do we have?
- tokeniser (ca 99)
- POS tagger (ca 96)
- lemmatiser (ca 99)
- shallow parser (ca 92), parser (gt 80)
- sentence aligner (ca 96)
- word aligner
- 75 precision
- 45 recall
5Whats the problem with Word Alignment?
(1) Alsop says, "I have a horror of the bad
American practice of choosing up sides in other
people's politics, ..." (2) Alsop förklarar "Jag
fasar för den amerikanska ovanan att välja sida i
andra människors politik, ... (Saul Bellow To
Jerusalem and back a personal account)
(1) Neutralitetspolitiken stöds av ett starkt
försvar till värn för vårt oberoende. (2) Our
policy of neutrality is underpinned by a strong
defence. (The Declarations of the Swedish
Government, 1988)
(1) Armén kommer att reformeras och
effektiviseras. (2) The army will be
reorganized with the aim of making it more
effective. (The Declarations of the Swedish
Government, 1988)
- Word alignment challenges
- non-linear mapping
- grammatical/lexical differences
- translation gaps
- translation extensions
- idiomatic expressions
- multi-word equivalences
(1) I take the middle seat, which I dislike, but
I am not really put out. (2) Jag tar
mittplatsen, vilket jag inte tycker om, men det
gör mig inte så mycket. (Saul Bellow To
Jerusalem and back a personal account)
(1) Our Hasid is in his late twenties. (2) Vår
chassid är bortåt de trettio. (Saul Bellow To
Jerusalem and back a personal account)
6So what? What are the real problems?
- Word alignment
- uses simple, fixed tokenisation
- fails to identify appropriate translation units
- ignores contextual dependencies
- ignores relevant linguistic information
- uses poor morphological analyses
7What do we need?
- flexible tokenisation
- possible multi-word units
- linguistic tools for several languages
- integration of linguistic knowledge
- combination of knowledge resources
- alignment in context
8Lets go!
- Clue Alignment!
- finding clues
- combining clues
- aligning words
9Word Alignment Clues
- DT NNP NNP NN VBZ
VBN RB - RGOS V_at_IIAS NCUSN_at_DS
conference
- The United Nations conference has started
today . - Idag började FN-konferensen .
-
konferensen
10Word Alignment Clues
- Def. A word alignment clue Ci(s,t) is a
probability which indicates an association
between two lexical items, s and t, from parallel
texts. - Def. A lexical item is a set of words with
associated features attached to it.
11How do we find clues? (1)
- Clues can be estimated from association scores
- Ci(s,t) wi Ai (s,t)
- co-occurrence
- Dice coefficient A1 (s,t) Dice (s,t)
- Mutual information A2 (s,t) I (st)
- string similarity
- longest common sub-seq.ratio A3 (s,t) LCSR
(s,t)
12How do we find clues? (2)
- Clues can be estimated from training data
- Ci(s,t) wi P (ft fs) ? wi freq(ft ,fs
)/freq(fs) - fs , ft are features of s and t, e.g.
- part-of-speech sequences of s, t
- phrase category (NP, VP etc), syntactic function
- word position
- context features
13How do we use clues? (1)
- Clues are simply sets of association measures
- The crucial point we have to combine them!
- If Ci(s,t) P(ai ), define the total clue as
- Call(s,t) P(A) P(a1? a2 ? ...? an)
- Clues are not mutually exclusive!
- P(a1? a2 ) P(a1) P(a2 ) - P(a1? a2 )
- Assume independence!
- P(a1? a2 ) P(a1) P(a2 )
14How do we use clues? (2)
- Clues can refer to any set of tokens from source
and target language segments. - overlaps
- inclusions
- Def. A clue shares its indication with all
member tokens! - allow clue combinations at the level of single
tokens
15Clue overlaps - an example
- The United Nations conference has started today.
- Idag började FN-konferensen.
Clue 1 (co-occurrence) United Nations
FN-konferensen 0.4 Nations conference
FN-konferensen 0.5 United FN-konferense 0.3
Clue 2 (string similarity) conference FN-konferens
en 0.57 Nations FN-konferensen 0.29
Clueall United FN-konferensen 0.58 Nations FN-ko
nferensen 0.787 conference FN-konferensen 0.785
16The Clue Matrix
Idag började FN-konferensen
The United Nations Conference has
started today
0.5 0.5 0.5
0.7 0.7
0.787
0.4 0.4
0.57
0.2
0.72
0.3 0.3
0.58
Clue 2 (string similarity) conference FN-konferen
sen 0.57 Nations FN-konferensen 0.29 today idag
0.4
Clue 1 (co-occurrence) The United
Nations FN-konferensen 0.5 United
Nations FN-konferensen 0.4 has började 0.2 start
ed började 0.6 started today idag 0.3 Nations
conference började 0.4
17Clue Alignment (1)
- general principles
- combine all clues and fill the matrix
- highest score best link
- allow overlapping links only
- if there is no better link for both tokens
- if tokens are next to each other
- links which overlap at one point form a link
cluster
18Clue Alignment (2)
- the alignment procedure
- 1. find the best link
- 2. remove the best link (set its value to 0)
- 3. check for overlaps
- accept add to set of link clusters
- dismiss otherwise
- 4. continue with 1 until no more links are found
- (or all values are below a certain threshold)
19Clue Alignment (3)
Idag började FN-konferensen
The United Nations conference has
started today
0.5 0.5 0.5
0 0 0
0.7 0.7
0 0
0.787
0.4 0.4
0
0.57
0
0.2
0
0.72
0.3 0.3
0
0.58
0
Best link Nations FN-konferensen 0.787
Link clusters Nations FN-konferensen
Best link started började 0.72
Link clusters Nations FN-konferensen started bö
rjade
Best link United FN-konferensen 0.7
Link clusters United Nations FN-konferensen start
ed började
Best link today idag 0.58
Link clusters United Nations FN-konferensen start
ed började today idag
Best link conference FN-konferensen 0.57
Link clusters United Nations conference
FN-konferensen started började today idag
Best link The FN-konferensen 0.5
Link clusters The United Nations conference
FN-konferensen started
började today idag
Link clusters The United Nations conference
FN-konferensen has started
började today idag
Best link has började 0.2
20Bootstrapping
- again clues can be estimated from training data
- self-training use available links as training
data - goal learn new clues for the next step
- risk increased noise (lower precision)
21Learning Clues
- POS-clue
- assumption word pairs with certain POS-tags are
more likely to be translations of each other than
other word pairs - features POS-tag sequences
- position clue
- assumption translations are relatively close to
each other (esp. in related languages) - features relative word positions
22So much for the theory! Results?!
- The setup Corpus and basic tools
- Saul Bellows To Jerusalem and back a personal
account , English/Swedish, about 170,000 words - English POS-tagger (Grok), trained on Brown, PTB
- English shallow parser (Grok), trained on PTB
- English stemmer, suffix truncation
- Swedish POS-tagger (TnT), trained on SUC
- Swedish CFG parser (Megyesi), rule-based
- Swedish lemmatiser, database taken from SUC
23Results!?! not yet
- basic clues
- Dice coefficient (? 0.3)
- LCSR (0.4), ? 3 characters/string
- learned clues
- POS clue
- position clue
- clue alignment threshold 0.4
- uniform normalisation (0.5)
24Results!!! Come on!
- Preliminary results ( work in progress )
- Evaluation 500 random samples have been linked
manually (Gold standard) - Metrics precisionPWA recallPWA (Ahrenberg et
al, 2000)
25Give me more numbers!
- The impact of parsing.
- How much do we gain?
- Alignment results with n-grams, (shallow)
parsing, and both
26One more thing.
- Stemming, lemmatisation and all that
- Do we need morphological analyses for Swedish and
English?
27Conclusions
- Combining clues helps to find links
- Linguistic knowledge helps
- POS tags are valuable clues
- word position gives hints for related languages
- parsing helps with the segmentation problem
- lemmatisation gives higher recall
- We need more experiments, tests with other
language pairs, more/other clues - recall precision is still low
28(No Transcript)
29POS clues - examples
score source target ---------------------------
------------------------------- 0.915479582146249
VBZ V_at_IPAS 0.91304347826087 WRB
RH0S 0.761904761904762 VBP
V_at_IPAS 0.701943844492441 RB
RG0S 0.674033149171271 VBD
V_at_IIAS 0.666666666666667 DT NNP NN
NCUSN_at_DS 0.647058823529412 PRP VBZ
PF_at_USS_at_S V_at_IPAS 0.625 NNS NNP
NP00N_at_0S 0.611859838274933 VB
V_at_N0AS 0.6 RBR RGCS 0.5 DT JJ
JJ NN DF_at_US_at_S AQP0SNDS NCUSN_at_DS
30Position clues - examples
score mapping ---------------------------------
--- 0.245022348638765 x -gt
0 0.12541095637398 x -gt -1 0.08969007424919
66 x -gt 1 0.0767611096745595 x -gt
-2 0.0560378264563555 x -gt
-3 0.0514572790070555 x -gt
2 0.0395256916996047 x -gt 6 7 8
31Open Questions
- Normalisation!
- How do we estimate the wis?
- Non-contiguous phrases
- Why not allow long distance clusters?
- Independence assumption
- What is the impact of dependencies?
- Alignment clues
- What is a bad clue, what is a good one?
- Contextual clues
32Clue alignment - example
be
ko var ställ
scher min fru undrar road för jag de
en lunch . amused 0 0 0 0 0
0 0 0 0 0 , 0 0 0
0 0 0 0 0 0 48 my 81 63
0 0 0 0 0 0 0 0 wife
58 80 0 0 0 0 0 0 0 0
asks 0 0 42 0 0 0 0 0 0
0 why 0 0 0 0 74 0 0
0 0 0 i 0 0 0 0 0
0 0 0 0 0 ordered 0 0 0
0 0 0 36 0 0 0 the 0 0
0 0 0 0 0 70 70 0 kosher
0 34 0 0 0 0 0 53 86 0
lunch 0 34 0 0 0 0 0 41 81
0 . 0 0 0 0 0 0 0
0 0 76
33Alignment - examples
the Middle East Mellersta Östern afford
kosta på at least åtminstone an
American satellite en satellit common sense
sunda förnuftet Jerusalem area
Jerusalemområdet kosher lunch
koscherlunch leftist anti-Semitism
vänsterantisemitism left-wing intellectuals
vänsterintellektuella literary history
litteraturhistoriska manuscript collection
handskriftsamling Marine orchestra
marinkårsorkester marionette theater
marionetteatern mathematical colleagues
matematikkolleger mental character
mentalitet far too alldeles
34Alignment - examples
a banquet en bankett a battlefield
ett slagfält a day dagen the Arab
states arabstaterna the Arab world
arabvärlden the baggage carousel
bagagekarusellen the Communist dictatorships
kommunistdiktaturerna The Fatah terrorists
Al Fatah-terroristerna the defense minister
försvarsministern the defense minister
försvarsminister the daughter dotter the
first President förste president
35Alignment - examples
American imperial interests amerikanska
imperialistintressenas Chicago schools
Chicagos skolor decidedly anti-Semitic
avgjort antisemitiska his identity sin
identitet his interest sitt intresse his
interviewer hans intervjuare militant Islam
militanta muhammedanismen no longer
inte längre sophisticated arms
avancerade vapen still clearly
uppenbarligen ännu dozen Russian dussin
ryska exceedingly intelligent utomordentligt
intelligent few drinks några
drinkar goyish democracy gojernas
demokrati industrialized countries
industrialiserade länderna has become
har blivit
36Gold standard - MWUs
link Secretary of State -gt
Utrikesminister link type regular unit type
multi -gt single source text Secretary of State
Henry Kissinger has won the Middle Eastern
struggle by drawing Egypt into the American
camp. target text Utrikesminister Henry
Kissinger har vunnit slaget om Mellanöstern genom
att dra in Egypten i det amerikanska lägret.
37Gold standard - fuzzy links
link unrelated -gt inte tillhör hans
släkt link type fuzzy unit type single -gt
multi source text And though he is not
permitted to sit beside women unrelated to him or
to look at them or to communicate with them in
any manner (all of which probably saves him a
great deal of trouble), he seems a good-hearted
young man and he is visibly enjoying
himself. target text Och fastän han inte får
sitta bredvid kvinnor som inte tillhör hans släkt
eller se på dem eller meddela sig med dem på
något sätt (alltsammans saker som utan tvivel
besparar honom en mängd bekymmer) verkar han vara
en godhjärtad ung man, och han ser ut att trivas
gott.
38Gold standard - null links
link do -gt link type null unit type
single -gt null source text"How is it that you
do not know English?" target text"Hur kommer det
sig att ni inte talar engelska?"
39Gold standard - morphology
link the masses -gt massorna link type
regular unit type multi -gt single source
text Arafat was unable to complete the classic
guerrilla pattern and bring the masses into the
struggle. target text Arafat har inte kunnat
fullborda det klassiska gerillamönstret och föra
in massorna i kampen.
40Evaluation metrics
- Csrc number of overlapping source tokens in
(partially) correct link proposals, Csrc0 for
incorrect link proposals - Ctrg number of overlapping target tokens in
(partially) correct link proposals, Ctrg0 for
incorrect link proposals - Ssrc number of source tokens proposed by the
system - Strg number of target tokens proposed by the
system - Gsrc number of source tokens in the gold
standard - Gtrg number of target tokens in the gold
standard
41Evaluation metrics - example
42Corpus markup (Swedish)
lts lang"sv" id"9"gt ltc id"c-1" type"NP"gt
ltw span"03" pos"PF_at_NS0_at_S" id"w9-1"
stem"det"gtDetlt/wgt lt/cgt ltc id"c-2"
type"VC"gt ltw span"42" pos"V_at_IPAS"
id"w9-2" stem"vara"gtÃrlt/wgt lt/cgt ltc
id"c-3"gt ltw span"73" pos"CCS" id"w9-3"
stemsom"gtsomlt/wgt lt/cgt ltc id"c-4"
type"NPMAX"gt ltc id"c-5" type"NP"gt ltw
span"113" pos"DI_at_NS_at_S" id"w9-4"
stem"en"gtettlt/wgt ltw span"155"
pos"NCNSN_at_IS" id"w9-5"gtbesÃklt/wgt lt/cgt
ltc id"c-6" type"PP"gt ltc id"c-7"gt
ltw span"211" pos"SPS" id"w9-6"
stem"1"gtilt/wgt lt/cgt ltc id"c-8"
type"NP"gt ltw span"239" pos"NCUSN_at_DS"
id"w9-7" stem"barndom"gtbarndomenlt/wgt
lt/cgt lt/cgt lt/cgt lt/sgt
43Corpus markup (English)
lts lang"en" id"9"gt ltchunk type"NP"
id"c-1"gt ltw span"02" pos"PRP"
id"w9-1"gtItlt/wgt lt/chunkgt ltchunk
type"VP" id"c-2"gt ltw span"32" pos"VBZ"
id"w9-2 stem"be"gtislt/wgt lt/chunkgt
ltchunk type"NP" id"c-3"gt ltw span"62"
pos"PRP" id"w9-3"gtmylt/wgt ltw span"99"
pos"NN" id"w9-4"gtchildhoodlt/wgt lt/chunkgt
ltchunk type"VP" id"c-4"gt ltw span"199"
pos"VBD" id"w9-5"gtrevisitedlt/wgt lt/chunkgt
ltchunk id"c-5"gt ltw span"281" pos"."
id"w9-6"gt.lt/wgt lt/chunkgt lt/sgt
44 is that all?
- How good are the new clues?
- Alignment results with learned clues only
- (neither LCSR nor Dice)