Mutual bilingual terminology extraction

About This Presentation

Title:

Description:

Number of Views:162

Avg rating:3.0/5.0

Slides: 17

Provided by: in91

Learn more at: http://www.lrec-conf.org

Category:

Tags: bilingual | extraction | mutual | terminology

Transcript and Presenter's Notes

Title: Mutual bilingual terminology extraction

1
Mutual bilingual terminology extraction

2
Introduction

Terms and Terminology
Terms linguistic units which have specialised
use.
Terminology the system of terms in a subject
field.
Terminology is vital for specialised
communication, in both mono lingual and multi
lingual contexts.

3
Mono and multi lingual terminology processing

Mono lingual terminology processing
Three steps extraction, validation, and
organisation.
Automatic extraction approaches linguistic (may
produce noises), statistical (may overlook
important but low frequency terms), and hybrid
approaches
Bilingual/Multilingual term extraction
The same three steps as in monolingual
terminology processing extraction, validation,
and organisation
Relying on parallel corpora aligned at a certain
level
Different models to align term candidates
Alignment as an independent step

4
Our approach mutual bilingual term extraction

Alignment plays an active role in term
extraction.
Automatic alignment is used to propagate the
strengths of terminology extraction from one
language into another.
Relying on the availability of parallel corpora
aligned at sentence level.

5
Mutual term extraction Three step

1 lists of term candidates are extracted for the
source and target languages
2 term candidates from the target language are
aligned to those in the source language
3 if a term candidate in the target language is
aligned to a term candidate in the source
language, its term score is increased this
candidate promoted.
Steps 1-3 can be repeated many times.

6
Mono-lingual term extraction

7
Term alignment

Contingency table-based method log-likelihood is
used to estimate the likelihood of a term
candidate in the source language is translated
into another term candidate in the target
language
The table is built using a parallel corpus
aligned at sentence level

8
Contingency table for lymph node and ganglio
linfático
9
Boosting algorithms

Hypothesis the term score of a term candidate in
one language can be used to improve the term
score of its aligned candidate in the other
language, and vice versa via boosting processes
Given that
AL(T1,T2) alignment score of the two term
candidates T1 and T2.
TCsT term score of the candidate T in the
source language
TCtT term score of the candidate T in the
target language
BT(TC1,TC2) boosting function, i.e. how the
term score of the aligned term affects the target
term score Example simple addition
BT(TC1,TC2)TC1TC2

10
Boosting algorithms (cont.)

Single boosting boosting process is performed on
the target language only
Foreach term candidate Tt in the target language
Tsargmax(AL(Tt,Ti))
TCtTtBT(TCsTs,TCtTt)
Double boosting boosting process is performed on
both source and target languages
Foreach term candidate Ts in the source language
Ttargmax(AL(Ts,Ti))
TCsTsBT(TCsTs,TCtTt)
Foreach term candidate Tt in the target language
Tsargmax((AL(Tt,Ti))
TCtTtBT(TCsTs,TCtTt)
Recursive boosting boosting process is repeated
for both languages until the term candidate lists
are stabilised.

11
Parameters

Factors affecting the outcome of the proposed
algorithms the alignment function AL, the
mechanism to calculate the initial term scores
TCs and TCt, and the boosting function BT.
Different combinations of these functions have
been experimented with.
The best term score function is frequency, and
the best boosting function is simple addition.
In our next research, we propose several
probabilistic models which provide better
probabilistic foundations for the boosting
function.

12
Evaluation data, gold standard, and evaluation
metrics

Data
MedlinePlus parallel texts (English/Spanish) on
the topic of Cancer
9,250 segments for each language
31,498 English words, 30344 Spanish words
Aligned by Trados winalign, manually corrected
Gold standard
389 English terms, 442 Spanish terms, and 357
term pairs have been validated and used as a gold
standard.
Evaluation metrics
F-measure

13
Evaluation results

Alignment accuracy
In total, the algorithm suggests 472 translation
pairs, of which 374 are confirmed as correct
translation. This suggests that the accuracy of
the alignment is 0.8.
Term extraction performance improved by 10 to 25

14
Results (cont.)
15
Conclusion and future directions