Improved Sentence Alignment for Building a Parallel Subtitle Corpus - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Improved Sentence Alignment for Building a Parallel Subtitle Corpus

Description:

Joerg Tiedemann, CLIN 17, Leuven. Improved Sentence ... Animation 2003 Finding Nemo. Comedy 2004 Win a Date With Tad Hamilton. Crime 2001 Training Day ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 31
Provided by: let9
Category:

less

Transcript and Presenter's Notes

Title: Improved Sentence Alignment for Building a Parallel Subtitle Corpus


1
Improved Sentence Alignment for Building a
Parallel Subtitle Corpus
  • Jö rg Tiedemann
  • j.tiedemann_at_rug.nl
  • Alfa Informatica
  • Rijksuniversiteit Groningen

2
Motivation
  • Why Movie Subtitles?
  • available on-line in many languages
  • various genres, different time periods
  • growing resource
  • (mainly) transcribed speech
  • many idiomatic expressions, slang, etc

3
Example Wayne's World
English 000026,500 -- 000028,434 Spend all
day with us. 000028,502 -- 000030,436 There
are two-- pardon me-- 000030,504 --
000034,440 two of everything in every Noah's
arcade. 000034,508 -- 000036,361 That
means two of Zantar, 000036,361 --
000036,884 That means two of Zantar, 000036,96
2 -- 000040,454 Bay Wolf, Ninja
Commando, Snake-azon, 000040,532 --
000041,464 Psycho Chopper... 000041,533 --
000043,467 It's really good seeing you,
Benjamin.
Dutch 000032,298 -- 000035,267 De wereld
van Wayne 000035,869 -- 000038,963 Er zijn
twee, excuseer me, twee van Zantar. 000039,205
-- 000041,173 ...gestoorde helicopters... 0000
41,541 -- 000045,272 Het is goed om je weer
te zien, Benjamin.
4
Source Opensubtitles.org
  • multilingual subtitle collection (user uploads)
  • no illegal downloads (they claim)
  • no registration required
  • extra information (release year, genre, rating)
  • I got their database of ca. 308,000 files
  • 232,643 subtitles for 18,900 movies in 59
    languages

5
Pre-processing

value"000026,500" / Spend
all day
with us
. value"000028,434" /
id"2.1"There are id"2.3"two -- id"2.5"pardon me id"2.7"-- value"000030,436" / value"000030,504" / two
of everythingw in id"2.12"every Noah'
s arcade
. value"000034,440" /
1 000026,500 -- 000028,434 Spend all day
with us. 2 000028,502 -- 000030,436 There
are two-- pardon me-- 3 000030,504 --
000034,440 two of everything in every Noah's
arcade. 4 000034,508 -- 000036,361 That
means two of Zantar, 5 000036,361 --
000036,884 That means two of Zantar,
6
... further steps
  • filter with textcat
  • sort by language, release year main genre
  • align only one version per movie language
  • 22,794 pairs of aligned subtitles for 2,780
    movies
  • 361 language pairs, combinations of 29 languages
  • about 23 million sentence alignments
  • 10,000 9 million words per language pair

7
Length-based sentence alignment
  • Standard approach Gale Church
  • assume high correlation between sentence lengths
    in source and target language
  • Time slot length approach
  • same as GaleChurch but with time lengths instead
    of character lengths
  • Both approaches fail badly!

8
Where are the problems?
English 000026,500 -- 000028,434 Spend all
day with us. 000028,502 -- 000030,436 There
are two-- pardon me-- 000030,504 --
000034,440 two of everything in every Noah's
arcade. 000034,508 -- 000036,361 That
means two of Zantar, 000036,361 --
000036,884 That means two of Zantar, 000036,96
2 -- 000040,454 Bay Wolf, Ninja
Commando, Snake-azon, 000040,532 --
000041,464 Psycho Chopper... 000041,533 --
000043,467 It's really good seeing you,
Benjamin.
Dutch 000032,298 -- 000035,267 De wereld
van Wayne 000035,869 -- 000038,963 Er zijn
twee, excuseer me, twee van Zantar. 000039,205
-- 000041,173 ...gestoorde helicopters... 0000
41,541 -- 000045,272 Het is goed om je weer
te zien, Benjamin.
9
Limitations of the length-based alignment
approaches
  • typical problem follow-up errors
  • subtitle-specific problems
  • many insertions/deletions (at all positions)
  • very free translations, lots of paraphrasing
  • merging, splitting, leaving out information
  • adding information (titles, background sounds)

10
Sentence alignment using time-slot overlaps
We align only subtitles for identical movie
files! ? corresponding text fragments should be
shown at approximately the same time ?
iteratively align text fragments with the largest
time overlap
11
Time overlaps
English 000026,500 -- 000028,434 Spend all
day with us. 000028,502 -- 000030,436 There
are two-- pardon me-- 000030,504 --
000034,440 two of everything in every Noah's
arcade. 000034,508 -- 000036,361 That
means two of Zantar, 000036,361 --
000036,884 That means two of Zantar, 000036,96
2 -- 000040,454 Bay Wolf, Ninja
Commando, Snake-azon, 000040,532 --
000041,464 Psycho Chopper... 000041,533 --
000043,467 It's really good seeing you,
Benjamin.
no overlap
Dutch 000032,298 -- 000035,267 De wereld
van Wayne 000035,869 -- 000038,963 Er zijn
twee, excuseer me, twee van Zantar. 000039,205
-- 000041,173 ...gestoorde helicopters... 0000
41,541 -- 000045,272 Het is goed om je weer
te zien, Benjamin.
21 alignment best overlap
12
Spend all day with us . There are two -- pardon
me -- two of everything in every Noah' s arcade
. That means two
of Zantar , That means two of Zantar , Bay Wolf ,
Ninja Commando , Snake- azon , Psycho Chopper
... It' s really
good seeing you , Benjamin .
You haven' t been into Shakey' s for
so long . Well ,
I' ve been real busy . It' s two for you ' cause
one won' t do . A
ll this week , kids under 6 get every fifth --
There' s a new pet .
Ch- Ch- Chia Chia Pet -- the pottery that
grows . They are
very fast . Simple .
Plug it in , and insert the plug from just
about anything .
Simple . Even
for our customers in Waukegan , Elgin , and
Aurora -- We' ll be there right on time .
De wereld van Wayne Er zijn twee , excuseer me ,
twee van Zantar . ... gestoorde helicopters
... Het is goed
om je weer te zien , Benjamin .
Je bent al heel lang niet meer
in Shakey' s geweest .
Ik heb het heel erg druk .
Het zijn er twee voor jou , want
eentje zal het niet doen .
De hele week , krijgen kinderen onder
de zes elke vijfde ...
Er is een nieuw huisdier Het Chia huisdier
. Het aardewerk
dat groeit . Zij
zijn erg snel .
Simpel . Plug
het in .
sentence length alignment
13
Spend all day with us . There are two -- pardon
me -- two of everything in every Noah' s arcade
. That means two
of Zantar , That means two of Zantar , Bay Wolf ,
Ninja Commando , Snake- azon , Psycho Chopper
... It' s really
good seeing you , Benjamin .
You haven' t been into Shakey' s for
so long . Well ,
I' ve been real busy .
It' s two for you ' cause one won' t do
. All this week
, kids under 6 get every fifth -- There' s a new
pet . Ch- Ch-
Chia Chia Pet -- the pottery that grows
. They are very
fast . Simple
. Plug it in ,
and insert the plug from just about anything .
De wereld van Wayne Er zijn twee , excuseer me ,
twee van Zantar . ... gestoorde helicopters
... Het is goed
om je weer te zien , Benjamin .
Je bent al heel lang niet meer
in Shakey' s geweest .
Ik heb het heel erg druk .
Het zijn er twee voor jou , want
eentje zal het niet doen .
De hele week , krijgen kinderen onder
de zes elke vijfde ...
Er is een nieuw huisdier Het Chia huisdier
. Het aardewerk
dat groeit . Zij
zijn erg snel . S
impel . Plug het
in .
time length alignment
14
Spend all day with us .
There are two -- pardon me -- two of
everything in every Noah' s arcade . That means
two of Zantar , That means two of Zantar , Bay
Wolf , Ninja Commando , Snake- azon , Psycho
Chopper ... It'
s really good seeing you , Benjamin
. You haven' t
been into Shakey' s for so long
. Well , I' ve
been real busy .
It' s two for you ' cause one won' t do
. All this week
, kids under 6 get every fifth -- There' s a new
pet . Ch- Ch-
Chia Chia Pet -- the pottery that grows
. They are very
fast . Simple
. Plug it in ,
and insert the plug from just about anything .
De wereld van
Wayne Er zijn twee , excuseer me , twee van
Zantar . ... gestoorde helicopters
... Het is
goed om je weer te zien , Benjamin
. Je bent al
heel lang niet meer in Shakey' s geweest
. Ik heb het
heel erg druk . H
et zijn er twee voor jou , want eentje zal het
niet doen . De
hele week , krijgen kinderen onder de zes elke
vijfde ... Er is een nieuw huisdier Het Chia
huisdier . Het
aardewerk dat groeit .
Zij zijn erg snel .
Simpel
. Plug het in . Het is simpel !
time overlap alignment
15
Evaluation
  • manual evaluation on a small sample
  • 2 language pairs (Dutch/English Dutch/German)
  • 5 randomly selected movies per language pair
  • 10 initial sentence alignments
  • 10 sentence alignments in the middle
  • 10 final sentence alignments

16
Evaluation
correct partially
wrong correct sentence length
dut-eng 64.2 9.2 26.6 sentence length
dut-ger 62.3 12.3 25.3 time length
dut-eng 54.6 6.9 38.6 time
length dut-ger 57.5
9.8 32.7 time overlap dut-eng
73.1 8.7 18.2 time overlap
dut-ger 85.7 6.8 7.5
17
Conclusions
  • extensive multilingual subtitle corpus
  • encoded in XML (UTF-8)
  • automatically sentence aligned
  • character/time length correspondence doesn't work
    well for sentence alignment
  • time overlap works well for sentence alignment

18
Discussion
  • time slot overlap works best
  • but still several problems
  • tokenization/sentence splitting incompatibilities
  • subtitles often start at different time offsets
  • differences in subtitle speed
  • the last 2 issues require time normalization add
    parameters time offset time scaling factor
  • can be computed using 2 fix points

19
Evaluation alignment position
correct partially wrong correct
sentence length initial
65.1 8.7 26.2 middle
63.6 9.1 27.3 final
61.7 13.5 24.8 time
length initial 63.0
8.9 28.1 middle 61.4
6.8 31.9 final 42.0
8.0 50.0 time overlap initial
84.5 10.1 5.4 middle
76.8 4.9 18.3 final
70.8 9.5 19.7
20
Evaluation selected movies
Dutch - English Animation 2003 Finding
Nemo Comedy 2004 Win a Date With Tad
Hamilton Crime 2001 Training
Day Documentary 2005 Grizzly Man Sci-Fi 2004 Cu
be Zero Dutch - German Action 1989 Batman Actio
n 2001 Rush Hour 2 Comedy 1986 Peggy Sue Got
Married Crime 2002 Cidade de
deus Drama 2002 The Ring
21
Evaluation by genre
correct partially wrong correct
time overlap Action 93.1 3.5
3.5 Animation 36.8 18.4 44.7 Comedy 94.6
3.3 2.2 Crime 97.9 2.2
0.0 Documentary 84.1 12.7
3.2 Drama 33.3 26.7 40.0 Sci-Fi 22.5
14.3 63.3 ... need a bigger sample!
22
Evaluation by year
correct partially wrong correct
time overlap 1986 82.8 10.3 6.9 1989
92.6 0.0 7.4 2001 95.7
4.3 0.0 2002 78.3 8.7 13.0 2003
36.8 18.4 44.7 2004 66.1
6.3 27.7 2005 84.1 12.7 3.2
23
Evaluation by movie
correct partially wrong
correct time overlap dut-eng
cube_zero 22.4 14.3 63.3 dut-eng
finding_nemo 36.8 18.4
44.7 dut-eng grizzly_man 84.1
12.7 3.2 dut-eng training_day
96.8 3.2 0.0 dut-eng win_a_date_with_tad
_hamilton 100.0 0.0 0.0 dut-ger
batman 92.6 0.0 7.4 dut-ger
cidade_de_deus 100.0 0.0
0.0 dut-ger peggy_sue_got_married 82.8
10.3 6.9 dut-ger rush_hour_2
93.5 6.5 0.0 dut-ger the_ring
33.3 26.7 40.0
24
20 largest bitexts
language nr sentences nr words pair
source target source target eng-spa
592,355 524,412 4,696,792 4,071,345 por-spa
443,521 414,725 3,124,539 3,170,790 cze-eng
403,605 421,135 2,581,318 3,260,751 eng-por
397,085 370,866 3,071,277 2,611,508 eng-slv
394,941 376,971 3,036,584 2,343,233 eng-swe
386,269 339,953 2,971,600 2,441,469 dut-eng
378,475 425,600 2,804,742 3,338,842 dut-spa
367,421 359,944 2,729,557 2,739,981 cze-por
365,676 366,861 2,311,908 2,532,080 cze-spa
361,038 335,278 2,278,212 2,532,657 cze-rum
347,454 345,553 2,220,880 2,491,271 por-rum
340,227 335,356 2,352,743 2,412,681 cze-slv
328,751 335,555 2,093,731 2,123,347 eng-pob
323,621 308,458 2,525,747 2,183,897 pob-spa
320,934 293,701 2,280,703 2,340,992 por-slv
320,691 323,199 2,229,287 2,015,813 eng-rum
313,346 300,392 2,459,545 2,138,001 dut-por
310,259 320,390 2,256,436 2,283,083 slv-spa
310,201 279,681 1,957,404 2,146,695 rum-slv
308,970 311,632 2,229,393 1,954,050
25
Language checking
  • textcat filter using 46 language models
  • 1920 unknown (not recognized)
  • 1768 maybe (several matches first lang matches)
  • 1554 rejected
  • languages in the corpus (after textcat)
  • chi bul dan cze dut ell eng est fin ger fre heb
    ice hrv hun ita jpn lav lit nor pob pol por rum
    rus slv spa swe tur

26
Time slot lengths
English 000026,500 -- 000028,434 1.9 Spend
all day with us. 000028,502 --
000030,436 1.9 There are two-- pardon
me-- 000030,504 -- 000034,440 3.9 two of
everything in every Noah's arcade. 000034,508
-- 000036,361 1.8 That means two of
Zantar, 000036,361 -- 000036,884 0.5 That
means two of Zantar, 000036,962 --
000040,454 3.5 Bay Wolf, Ninja
Commando, Snake-azon, 000040,532 --
000041,464 0.9 Psycho Chopper... 000041,533
-- 000043,467 1.9 It's really good seeing
you, Benjamin.
Dutch 000032,298 -- 000035,267 3 De wereld
van Wayne 000035,869 -- 000038,963 3.1 Er
zijn twee, excuseer me, twee van
Zantar. 000039,205 -- 000041,173 2 ...gestoo
rde helicopters... 000041,541 --
000045,272 3.7 Het is goed om je weer te zien,
Benjamin.
27
Example Delicatessen (1991)
1 000447,162 -- 000449,152 2 pounds. And fat
ones. 2 000450,022 -- 000451,752 That's
shoulder, right? 3 000452,262 --
000455,472 The Kube Brother's usual. Tell me
about it! 4 000456,792 -- 000458,012 How
much? 5 000458,802 -- 000459,812 2
measures.
1 000442,785 -- 000447,984 950 gram. Schoon
aan de haak. - Het is toch wel een schouderstuk,
hé ? 2 000448,185 -- 000452,542 Zoals
altijd voor de gebroeders Kube. U zult tevreden
zijn. 3 000452,745 -- 000456,021 Hoeveel
krijgt u van ons ? - Twee eenheden.
28
Pre-processing

value"000447,162" / 2
pounds .
And id"2.2"fat ones id"2.4". value"000449,152" /
id"3.1"That' s id"3.3"shoulder ,
right ?
...
1 000447,162 -- 000449,152 2 pounds. And fat
ones. 2 000450,022 -- 000451,752 That's
shoulder, right? 3 000452,262 --
000455,472 The Kube Brother's usual. Tell me
about it! 4 000456,792 -- 000458,012 How
much? 5 000458,802 -- 000459,812 2
measures.
29
Pre-processing
  • 2 subtitle formats used SRT and SUB
  • language-specific character encoding ? UTF8
  • automatic tokenization/sentence splitting (no
    manual correction)
  • language checking
  • filter out incorrect uploads using textcat for
    language classification (46 language models for
    utf8 text)
  • sort by language, release year main genre

30
Wayne's World
English 000026,500 Spend all day with
us. 000028,502 There are two-- pardon
me-- 000030,504 two of everything in every
Noah's arcade. 000034,508 That means two of
Zantar, 000036,361 That means two of
Zantar, 000036,962 Bay Wolf, Ninja Commando,
Snake-azon, 000040,532 Psycho
Chopper... 000041,533 It's really good seeing
you, Benjamin. 000043,535 You haven't
been into Shakey's for so long. 000045,537
Well, I've been real busy. 000047,973 It's two
for you 'cause one won't do. 000051,410
All this week, kids under 6 000053,545 get
every fifth-- 000054,546 There's a new
pet. 000055,547 Ch-Ch-Chia 000056,815 Chia
Pet-- the pottery that grows. 000059,251 They
are very fast. 000102,121 Simple. Plug it
in, 000103,622 and insert the plug from just
about anything.
Dutch 000032,298 De wereld van
Wayne 000035,869 Er zijn twee, excuseer me,
twee van Zantar. 000039,205 ...gestoorde
helicopters... 000041,541 Het is goed om je
weer te zien, Benjamin. Je bent al
heel lang niet meer in Shakey's
geweest. 000045,845 Ik heb het heel erg
druk. 000047,580 Het zijn er twee voor jou,
want eentje zal het niet
doen. 000051,117 De hele week, krijgen
kinderen onder de zes elke
vijfde... 000054,287 Er is een nieuw
huisdier Het Chia huisdier. 000056,456 Het
aardewerk dat groeit. 000059,526 Zij zijn erg
snel. 000103,263 Simpel. Plug het in.
Write a Comment
User Comments (0)
About PowerShow.com