Mutual bilingual terminology extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Mutual bilingual terminology extraction

Description:

MedlinePlus parallel texts (English/Spanish) on the topic of Cancer ... 31,498 English words, 30344 Spanish words. Aligned by Trados winalign, manually corrected ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 17
Provided by: in91
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Mutual bilingual terminology extraction


1
Mutual bilingual terminology extraction
  • Le An Ha, Gabriela Fernandez, Ruslan Mitkov,
    Gloria Corpas
  • University of Wolverhampton
  • Universidad de Sevilla
  • Universidad de Malaga
  • E-mail l.a.ha,r.mitkov_at_wlv.ac.uk,
    gfernan_at_us.es, gcorpas_at_ya.com

2
Introduction
  • Terms and Terminology
  • Terms linguistic units which have specialised
    use.
  • Terminology the system of terms in a subject
    field.
  • Terminology is vital for specialised
    communication, in both mono lingual and multi
    lingual contexts.

3
Mono and multi lingual terminology processing
  • Mono lingual terminology processing
  • Three steps extraction, validation, and
    organisation.
  • Automatic extraction approaches linguistic (may
    produce noises), statistical (may overlook
    important but low frequency terms), and hybrid
    approaches
  • Bilingual/Multilingual term extraction
  • The same three steps as in monolingual
    terminology processing extraction, validation,
    and organisation
  • Relying on parallel corpora aligned at a certain
    level
  • Different models to align term candidates
  • Alignment as an independent step

4
Our approach mutual bilingual term extraction
  • Alignment plays an active role in term
    extraction.
  • Automatic alignment is used to propagate the
    strengths of terminology extraction from one
    language into another.
  • Relying on the availability of parallel corpora
    aligned at sentence level.

5
Mutual term extraction Three step
  • 1 lists of term candidates are extracted for the
    source and target languages
  • 2 term candidates from the target language are
    aligned to those in the source language
  • 3 if a term candidate in the target language is
    aligned to a term candidate in the source
    language, its term score is increased this
    candidate promoted.
  • Steps 1-3 can be repeated many times.

6
Mono-lingual term extraction
  • Lexical-syntactic-statistical approach
  • Lexical-syntactic POS patterns
  • English AN(NP)?ANN
  • Spanish NNA(PN)?NA
  • Statistical measures
  • Different measures tested
  • Frequency is chosen

7
Term alignment
  • Contingency table-based method log-likelihood is
    used to estimate the likelihood of a term
    candidate in the source language is translated
    into another term candidate in the target
    language
  • The table is built using a parallel corpus
    aligned at sentence level

8
Contingency table for lymph node and ganglio
linfático
9
Boosting algorithms
  • Hypothesis the term score of a term candidate in
    one language can be used to improve the term
    score of its aligned candidate in the other
    language, and vice versa via boosting processes
  • Given that
  • AL(T1,T2) alignment score of the two term
    candidates T1 and T2.
  • TCsT term score of the candidate T in the
    source language
  • TCtT term score of the candidate T in the
    target language
  • BT(TC1,TC2) boosting function, i.e. how the
    term score of the aligned term affects the target
    term score Example simple addition
    BT(TC1,TC2)TC1TC2

10
Boosting algorithms (cont.)
  • Single boosting boosting process is performed on
    the target language only
  • Foreach term candidate Tt in the target language
  • Tsargmax(AL(Tt,Ti))
  • TCtTtBT(TCsTs,TCtTt)
  • Double boosting boosting process is performed on
    both source and target languages
  • Foreach term candidate Ts in the source language
  • Ttargmax(AL(Ts,Ti))
  • TCsTsBT(TCsTs,TCtTt)
  • Foreach term candidate Tt in the target language
  • Tsargmax((AL(Tt,Ti))
  • TCtTtBT(TCsTs,TCtTt)
  • Recursive boosting boosting process is repeated
    for both languages until the term candidate lists
    are stabilised.

11
Parameters
  • Factors affecting the outcome of the proposed
    algorithms the alignment function AL, the
    mechanism to calculate the initial term scores
    TCs and TCt, and the boosting function BT.
  • Different combinations of these functions have
    been experimented with.
  • The best term score function is frequency, and
    the best boosting function is simple addition.
  • In our next research, we propose several
    probabilistic models which provide better
    probabilistic foundations for the boosting
    function.

12
Evaluation data, gold standard, and evaluation
metrics
  • Data
  • MedlinePlus parallel texts (English/Spanish) on
    the topic of Cancer
  • 9,250 segments for each language
  • 31,498 English words, 30344 Spanish words
  • Aligned by Trados winalign, manually corrected
  • Gold standard
  • 389 English terms, 442 Spanish terms, and 357
    term pairs have been validated and used as a gold
    standard.
  • Evaluation metrics
  • F-measure

13
Evaluation results
  • Alignment accuracy
  • In total, the algorithm suggests 472 translation
    pairs, of which 374 are confirmed as correct
    translation. This suggests that the accuracy of
    the alignment is 0.8.
  • Term extraction performance improved by 10 to 25

14
Results (cont.)
15
Conclusion and future directions
  • A promising approach, but
  • More research will be needed
  • A better mathematical foundation
  • Probabilistic models
  • More experiments
  • Other domains and language pairs
  • Legal
  • English-Hindi

16
Thank you very much
  • Questions? Comments? Criticisms?
Write a Comment
User Comments (0)
About PowerShow.com