The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench

Description:

Development of terminological resources in medicine is a major issue to allow ... MedDRA (Medical Dictionary for Drug regulatory Activities) defines fully ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 37
Provided by: nlt5
Category:

less

Transcript and Presenter's Notes

Title: The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench


1
The PERTOMed ProjectExploiting and validating
terminological resources of comparable
Russian-French-English corpora within
Pharmacovigilance
  • Cedric BOUSQUET
  • INSERM U729 (Faculté de Médecine - Paris 5)
  • cedric.bousquet_at_spim.jussieu.fr
  • Maria ZIMINA
  • EA2290 SYLED (Paris 3) / CRIM-INaLCO (Paris)
  • zimina_at_msh-paris.fr

2
Outline
  • Introduction. The PERTOMed Project
  • Background
  • Objectives
  • Material parallel French/English vs. comparable
    Russian corpora
  • Research methodology
  • SYNTEX
  • Repeated Segments extraction
  • Multiple co-occurrences
  • Collaboration domain expert / corpus linguist
  • Discussion
  • Positive results
  • Limits
  • Conclusions and future work

3
The PERTOMed Project
  • Project Directors
  • Marie-Christine Jaulent (INSERM, Paris).
  • Jean Charlet (INSERM, Paris).
  • Partners
  • INSERM U729, Faculté de Médecine - Paris 5
    (France).
  • ERSS Equipe de Recherche en Syntaxe et
    Sémantique, UMR 5610 CNRS and Toulouse le Mirail
    University (France).
  • CRIM Centre de Recherche en Ingénierie
    Multilingue, INaLCO (Paris, France).

4
The PERTOMed Project
  • Development of terminological resources in
    medicine is a major issue to allow collecting
    data and browsing knowledge databases.
  • The objective of the PERTOMed project (Production
    et évaluation de ressources terminologiques et
    ontologiques dans le domaine de la médecine) was
    to build terminological or ontological resources
    from texts in the medical domain.
  • Potential applications concern several fields
  • Pharmacovigilance
  • Pneumology
  • Drug-drug interactions
  • Multilingual terminologies

5
Pharmacy-related issues in Russia
  • Several pharmaceutical companies are present in
    Russia medicines produced in EU or USA are also
    commercialised in Russia.
  • Required qualities of translation of product
    information Precision, Reproducibility,
    Exactness
  • High quality of drug product information
    translations is vital for
  • Pharmaceutical companies
  • Russian regulatory authorities
  • Medical doctors
  • Pharmacists
  • Consumers

6
Pharmacovigilance
  • According to World Health Organization (WHO),
    pharmacovigilance is the science and activities
    relating to the detection, assessment,
    understanding and prevention of adverse effects
    or any other drug-related problems.

7
Available international terminologies in
Pharmacovigilance
  • WHO-ART (World Health Organization Adverse
    Reaction Terminology) was developed in English
    with translations into French, German, Spanish,
    Portuguese and Italian.
  • MedDRA (Medical Dictionary for Drug regulatory
    Activities) defines fully equivalent medical
    terms in different languages, including English,
    French, German, Japanese and Spanish.

8
Objectives
  • To propose methods for creating terminological
    resources from comparable French-English-Russian
    corpora on adverse drug reactions.
  • To build a trilingual French-English-Russian
    terminological resource describing adverse drug
    reactions.

9
Available resources
  • Parallel French-English medical text
    corporaSummaries of Product Characteristics
    (SPC).
  • Comparable medical corpora on Russian Web sites.

10
SPC Summary of Product Characteristics
  • European Medicines Agency (EMEA) is a
    decentralised EU body with headquarters in
    London.
  • Companies submit a single marketing authorisation
    application to the EMEA.
  • In case of approval given by the Committee for
    Medicinal Products for Human Use (CMPHU),
    applicants receive a single market authorisation
    valid for the entire EU.
  • The SPCs are provided in all EU languages
    (undesirable effects are described in Section
    4.8).

11
French-English corpus from the PERTOmed Project
(C. Bousquet)
  • 156 SPCs in French and English downloaded as PDF
    files.
  • NLP processing by SYNTEX (French/English parser
    and term extractor).

12
SYNTEX (D. Bourigault, S. Ozdowska)
  • Step 1 Sentence alignment (JAPA)
  • Step 2 Part-of-Speech tagging (TreeTagger)
  • Step 3 Parsing (Syntax)
  • syntactic dependencies are identified (subjects,
    direct and indirect objects of verbs
  • Step 4 Identification of anchor pairs
  • cognates, translation equivalents within aligned
    sentences
  • Step 5 Alignment by syntactic propagation

Subject
the two medicinal products are used concurrently.
ces deux produits sont administrés de
manière concomitante.
Subject
13
Comparable resources on medicinal products in
Russia (J. Ivanova et I. Nuk)
  • Russian Websites selected for the Project
  • RECIPE http//www.recipe.ru
  • RLS http//www.rlsnet.ru
  • Russian Vidal http//www.vidal.ru
  • Criteria for comparability with SPC
  • Degree of specialisation
  • Clarity and precision
  • Recognition by domain experts in Russia
  • Information granularity
  • Style (summarization)
  • Possible text to text alignment direct search in
    Russian by active component or medicinal product

14
The RECIPE Website
The site of legal pharmacological documentation
Medline user manual, index of Russian bio-medical
Websites, several criteria to search for medical
products (including ICD-10) http//www.recipe.ru
15
The RLS Website
Le site RLS, acronym of ??????? ?????????????
??????? ?????? Register of Medical Substances of
Russia  http//www.rlsnet.ru/ Encyclopaedia of
medical products and product description.
16
The Russian Vidal Website
Russian Vidal http//www.vidal.ru is edited and
regularly updated by the private company
AstraPharmService in accordance with the
Industrial Standard of Russian Federation.
17
Russian corpus from the PERTOMed Project results
of Correspondence Analysis (Lexico3)
Regardless various origins (different Websites
used to collect information), the descriptions of
medical products in Russian gathered within the
corpus tend to share common lexical
characteristics
18
Material parallel vs. comparable corpora
  • Difficulties
  • Corpus size differences
  • Information coverage?
  • Degree of comparability?
  • NLP tools/methods for comparable multilingual
    text processing?

Delimiting characters .,!?/_-\"'()
19
Methods for building terminologies from
comparable corpora (1/2)
  • If two words are mutual translations, their
    collocates are likely to correspond as well
  • Collocation is defined as a co-occurrence
    relation.
  • Domain specific words co-occur with general words
    (possibility to use general bilingual
    dictionaries).
  • Mapping through bilingual dictionary
  • Build context vectors for source and target
    words.
  • Translate context vectors.
  • Compute similarity between source and target
    context vectors.

20
Methods for building terminologies from
comparable corpora (2/2)
  • Statistical Machine Translation
  • A translation model is learned from existing
    translations (parallel corpora).
  • Alignment probabilities are introduced to refine
    the model.
  • Limits considerable amounts of training data,
    several heuristics possible.
  • Mixed approaches
  • Syntactic relations transfer, co-occurrence
    relations, dictionary mapping, alignment
    probabilities
  • Problems
  • Lack of equivalence between tools performing
    similar tasks on different languages
  • Term extraction from comparable corpora not
    satisfying yet.

21
Repeated Segments extraction
Repeated Segments (SALEM 1987) series of
consecutive forms whose frequency is greater then
or equal to 2 in the corpus
EN
FR
syndrome de stevens johnson 27
syndrome pseudo grippal 23 de
syndrome de 13 syndrome de lyell 11
syndrome de détresse respiratoire 8
syndrome de stevens johnson et 8
de syndrome de stevens johnson 7 un
syndrome de 7 syndrome
dhyperstimulation ovarienne 7 un
syndrome grippal 5 un syndrome pseudo
grippal 5 syndrome de turner 5
stevens
johnson syndrome 25
respiratory distress syndrome 8
ovarian hyperstimulation syndrome 7
flu like
syndrome 6 multiforme stevens
johnson syndrome 6 adult
respiratory distress syndrome 5 erythema
multiforme stevens johnson syndrome 5
RU
?????????????? ??????? 13
??????? ???????? ???????? 7
??? ??????? 2
??????? ??????????????? ???????? 3
??????? ? ????????????????
????????? 2
??????? ??????? 2
??????? ?????? ??????? 2
??????? ????????????? ????????? 2
22
Multiple co-occurrences
MARTINEZ (2003) The method is based on iterative
calculation of lexical attractions. Filtering
techniques reduce the number of contextual
explorations
A
B
E
G
A
B
E
G
H
F
I
A
B
E
H
C
D
F
I
A
B
C
A
Only non-inclusive paths are selected
D
A
23
Choosing comparable textual units as starting
points for exploration
?? ??????? (F216)
CONTEXT FR troubles du système nerveux
insomnie hypoesthésie paresthésies. EN
nervous system disorders dizziness,
paraesthesia, hyperaesthesia. RU ?? ???????
??????? ??????? ??????????????, ???????? ????,
???????, ?????????, ??????????, ???????????,
???????????, ????????? ???, ???????????, ???????.
24
Exploring collocation networks French
Contexte n1292 (15 formes dont 3 vedettes)
Densité info.0.20 troubles du métabolisme et de
la nutrition augmentation des triglycérides
sériques, augmentation du cholestérol sérique.
Contexte n1414 (12 formes dont 3 vedettes)
Densité info.0.25 troubles du métabolisme et de
la nutrition augmentation de la créatinine,
hypokaliémie. Contexte n1425 (12 formes dont
3 vedettes) Densité info.0.25 troubles du
métabolisme et de la nutrition élévation de
l'urée sanguine. Contexte n3180 (10 formes
dont 3 vedettes) Densité info.0.30 troubles du
métabolisme et de la nutrition oedèmes, oedèmes
périphériques Contexte n4667 (12 formes dont 3
vedettes) Densité info.0.25 troubles du
métabolisme et de la nutrition élévation de
l'urée sanguine. Contexte n6157 (10 formes
dont 3 vedettes) Densité info.0.30 troubles du
métabolisme et de la nutrition fréquent
hypertriglycéridémie, hyperglycémie Contexte
n8334 (13 formes dont 3 vedettes) Densité
info.0.23 troubles du métabolisme et de la
nutrition prise de poids ou amaigrissement,
oedèmes. Contexte n10151 (19 formes dont 3
vedettes) Densité info.0.16 troubles du
métabolisme et de la nutrition très fréquents
perte de poids fréquents perte d'appétit, prise
de poids
Legend f s c f co-frequency s
specificity c number of contexts
25
Exploring collocation networks English
Contexte n32 (4 formes dont 3 vedettes) Densité
info.0.75 metabolism and nutrition
disorders Contexte n3715 (7 formes dont 3
vedettes) Densité info.0.43 metabolism and
nutrition disorders oedema, peripheral
oedema Contexte n3763 (5 formes dont 3
vedettes) Densité info.0.60 metabolism and
nutrition disorders hypokalaemia. Contexte
n5392 (23 formes dont 3 vedettes) Densité
info.0.13 metabolism and nutrition disorders
very common hypercholesterolemia,
hypertriglyceridemia (hyperlipemia)
hypokalaemia increased lactic dehydrogenase
(ldh) common liver function tests abnormal
increased sgot, increased sgpt. Contexte
n7698 (7 formes dont 3 vedettes) Densité
info.0.43 metabolism and nutrition disorders
common hypertriglyceridaemia,
hyperglycaemia Contexte n8856 (11 formes dont
3 vedettes) Densité info.0.27 metabolism and
nutrition disorders abnormal renal function
tests (increased creatinine, bun) Contexte
n9771 (9 formes dont 3 vedettes) Densité
info.0.33 metabolism and nutrition disorders
weight gain or loss, oedema. Contexte
n11578 (13 formes dont 3 vedettes) Densité
info.0.23 metabolism and nutrition disorders
very common weight loss common decreased
appetite, weight increase
Legend f s c f co-frequency s
specificity c number of contexts
26
Exploring collocation networks Russian
Contexte n150 (10 formes dont 4 vedettes)
Densité info.0.40 ?? ??????? ?????? ???????
?????????? ????????? ???????, ????? ???
?????????????. Contexte n832 (37 formes dont
4 vedettes) Densité info.0.11 c10500 ????????
???????? ?? ??????? ?????? ??????? ?????
?????????? ???????????, ??? ? ??? ??????????
?????? ????????????????? ??????????, ????
???????? ????????, ????????????????
????????????????? ? ???????? ????????????, ?????
??? ?????????? ??????????, ??????,
??????????????, ?????????? ???????, ????????????,
???????, ????????, ???????????. Contexte
n930 (18 formes dont 4 vedettes) Densité
info.0.22 ?? ??????? ?????? ???????
??????????????????, ?????????? (????? ??????????)
????? ????, ???????? ??????, ?????????????,
????????????? ??????????, ????????????? ????,
???. Contexte n1439 (10 formes dont 4
vedettes) Densité info.0.40 ?? ??????? ??????
??????? ?????????????, ???????? ??????,
?????????? ????????, ????????????. Contexte
n1459 (12 formes dont 4 vedettes) Densité
info.0.33 ?? ??????? ?????? ???????
?????????????, ?????????? ????????? ???????,
???????? ?? ???????????? ????. /
Contexte n647 (16 formes dont 4 vedettes)
Densité info.0.25 ?? ??????? ???????????????
??????? ???????? ???????, ?????, ?????????
?????????? act, ??? ? ??? ??????? ??????
????????. Contexte n718 (23 formes dont 4
vedettes) Densité info.0.17 c752 ????????
???????? ?? ??????? ??????????????? ???????
???????? ???? ? ?????????? ? ??????????????
???????, ???????, ?????, ??????, ????????
????????, ????????? ?????????? ??????????
???????????. Contexte n873 (31 formes dont 4
vedettes) Densité info.0.13 ?? ???????
??????????????? ??????? ?????-?????????
?????????? ??? ????????-????????? ??????????
???, ???, ?? ? ?????? ?????? ??????????, ???????,
?????, ??????, ???? ? ?????? ? ?????????
???????-???????, ??????? ?????????????????
???????.
Legend f s c f co-frequency s
specificity c number of contexts
27
Combining with French-English lexicon extracted
from parallel corpus
  • Automatic segmentation into textual units
    (Lexico3)
  • Forms, Repeated Segments (Russian-French-English)
  • Identification of anchor pairs (starting points)
  • Frequency counts, cognates, general words,
    French/English terminology
  • (syndrome / syndrome / ???????)
  • Trilingual collocation networks (COOCS)
  • Identification of similar context vectors.
  • Semi-automatic segmentation into terminological
    units.
  • Cross-language check.
  • Expert validation.

28
Building trilingual terminology collaboration
domain expert/corpus linguist
  • Two different kinds of knowledge / skills
  • From corpus linguist
  • Methodological knowledge tools and methods for
    text exploration.
  • Quantitative results on corpora.
  • From domain expert
  • Domain specific knowledge on ADRs.
  • Choice of relevant terms/contexts when several
    variants attested in texts.

29
Results trilingual lexicon on the Web
PERTOMed Server http//baneyx.net/SPIP/
Each trilingual entry comprises the following
fields - Simple term (with possible variants) -
Abbreviation (if applicable) - Related composed
term(s) - Domain(s) - Medical product(s) concerned
30
Results choosing terms
?????????????? ??????? / syndrome pseudo-grippal
/ influenza-like symptom, flu-like symptom
31
Results choosing domains
32
Discussion positive results
  • 430 validated trilingual terminological entries
    in XML format
  • 2002 simple terminological records (single word
    terms)
  • 1006 complex terminological records (50)
    (multiword terms)
  • Co-occurrence relations

33
Discussion limits
  • Lexical coverage.
  • Contextual access.
  • Presentation (no visual aids for navigation
    yet).
  • Evaluation difficulties
  • Choice of criteria.
  • Comparable resources needed.

34
Conclusions
  • Creating terminological resources from comparable
    corpora is faced with intrinsic heterogeneity of
    texts.
  • The challenge of exploring texts coming from
    different cultural and linguistic sources should
    be taken into account in the terminology project
    feasibility study.
  • Creation of Russian Internet corpus in the field
    of Pharmacovigilance is a pioneering work.
  • The use of textometric approach for comparable
    corpora exploration gives encouraging results.
  • Our methods should be improved taking into
    account the availability of new tools / resources
    for processing Russian texts.

35
Future work
  • Intertextual exploration on the document level
    based on visual aids

DISTRIBUTION INVENTORY OF REPEATED SEGMENTS
216 ---- ---- ---- ---- ---- ?? ???????
2 ---- ?? ??????? ??????????? ?
?????????????? ??????? ??????? 16 ----
---- ---- ---- ?? ??????? ???
3 ---- ?? ??????? ??? ???????? ????
?????????????? 3 ----
?? ??????? ??? ? ?????????????? ??????? ???????
4 ---- ---- ---- ?? ??????? ???
???????? 2 ---- ??
??????? ??? ???????? ?????????? ????????????
???????? ???? 5 ---- ---- ---- ??
??????? ??????????? ??????? 2
---- ---- ?? ??????? ??????????? ???????
???????? 10 ---- ---- ---- ??
??????? ?????? ???????? 2
---- ---- ?? ??????? ?????? ???????? ????????
2 ---- ---- ?? ??????? ??????
???????? ???? 4 ---- ---- ??
??????? ?????? ???????? ??????? 2 ----
---- ---- ---- ?? ??????? ????? 2
---- ---- ---- ?? ??????? ????????????
??????????? 4 ---- ---- ---- ??
??????? ????????????????? ???????
36
Publications on the PERTOMed Project
  • Baneyx A., Charlet J., Jaulent M.-C. (2005)
    "Building medical ontologies based on terminology
    extraction from texts methodological
    propositions". In Proceedings of the 10th
    Conference on Artificial Intelligence in Medicine
    in Europe, Lecture Notes in Computer Science,
    Aberdeen, GB, July 2005. Springer.
  • Jaulent M.-C., Charlet J. (2006) "PERTOMed
    Production et évaluation de ressources
    terminologiques et ontologiques dans le domaine
    de la médecine". PERTOMed Rapport de fin de
    projet, INSERM U729.
  • Nuk I., Ivanova J. (2005) "Création dune
    terminologie français/russe dans le domaine de la
    pharmacovigilance". Mémoire de DESS (dir. Monique
    Slodzian), Centre de Recherche en Ingénierie
    Multilingue, INaLCO.
  • Ozdowska S., Névéol A., Thirion B. (2005)
    "Traduction compositionnelle automatique de
    bitermes dans des corpus anglais/français
    alignés". Actes de la Conférence Terminologie et
    Intelligence Artificielle, TIA'05, Rouen, France.
Write a Comment
User Comments (0)
About PowerShow.com