The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench

Description:

Development of terminological resources in medicine is a major issue to allow ... MedDRA (Medical Dictionary for Drug regulatory Activities) defines fully ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 37

Provided by: nlt5

Category:

more less

Transcript and Presenter's Notes

Title: The PERTOMed Project: Exploiting and validating terminological resources of comparable RussianFrench

1
The PERTOMed ProjectExploiting and validating
terminological resources of comparable
Russian-French-English corpora within
Pharmacovigilance

Cedric BOUSQUET
INSERM U729 (Faculté de Médecine - Paris 5)
cedric.bousquet_at_spim.jussieu.fr
Maria ZIMINA
EA2290 SYLED (Paris 3) / CRIM-INaLCO (Paris)
zimina_at_msh-paris.fr

2
Outline

Introduction. The PERTOMed Project
Background
Objectives
Material parallel French/English vs. comparable
Russian corpora
Research methodology
SYNTEX
Repeated Segments extraction
Multiple co-occurrences
Collaboration domain expert / corpus linguist
Discussion
Positive results
Limits
Conclusions and future work

3
The PERTOMed Project

Project Directors
Marie-Christine Jaulent (INSERM, Paris).
Jean Charlet (INSERM, Paris).
Partners
INSERM U729, Faculté de Médecine - Paris 5
(France).
ERSS Equipe de Recherche en Syntaxe et
Sémantique, UMR 5610 CNRS and Toulouse le Mirail
University (France).
CRIM Centre de Recherche en Ingénierie
Multilingue, INaLCO (Paris, France).

4
The PERTOMed Project

Development of terminological resources in
medicine is a major issue to allow collecting
data and browsing knowledge databases.
The objective of the PERTOMed project (Production
et évaluation de ressources terminologiques et
ontologiques dans le domaine de la médecine) was
to build terminological or ontological resources
from texts in the medical domain.
Potential applications concern several fields
Pharmacovigilance
Pneumology
Drug-drug interactions
Multilingual terminologies

5
Pharmacy-related issues in Russia

Several pharmaceutical companies are present in
Russia medicines produced in EU or USA are also
commercialised in Russia.
Required qualities of translation of product
information Precision, Reproducibility,
Exactness
High quality of drug product information
translations is vital for
Pharmaceutical companies
Russian regulatory authorities
Medical doctors
Pharmacists
Consumers

6
Pharmacovigilance

According to World Health Organization (WHO),
pharmacovigilance is the science and activities
relating to the detection, assessment,
understanding and prevention of adverse effects
or any other drug-related problems.

7
Available international terminologies in
Pharmacovigilance

WHO-ART (World Health Organization Adverse
Reaction Terminology) was developed in English
with translations into French, German, Spanish,
Portuguese and Italian.
MedDRA (Medical Dictionary for Drug regulatory
Activities) defines fully equivalent medical
terms in different languages, including English,
French, German, Japanese and Spanish.

8
Objectives

To propose methods for creating terminological
resources from comparable French-English-Russian
corpora on adverse drug reactions.
To build a trilingual French-English-Russian
terminological resource describing adverse drug
reactions.

9
Available resources

Parallel French-English medical text
corporaSummaries of Product Characteristics
(SPC).
Comparable medical corpora on Russian Web sites.

10
SPC Summary of Product Characteristics

European Medicines Agency (EMEA) is a
decentralised EU body with headquarters in
London.
Companies submit a single marketing authorisation
application to the EMEA.
In case of approval given by the Committee for
Medicinal Products for Human Use (CMPHU),
applicants receive a single market authorisation
valid for the entire EU.
The SPCs are provided in all EU languages
(undesirable effects are described in Section
4.8).

11
French-English corpus from the PERTOmed Project
(C. Bousquet)

156 SPCs in French and English downloaded as PDF
files.
NLP processing by SYNTEX (French/English parser
and term extractor).

12
SYNTEX (D. Bourigault, S. Ozdowska)

Step 1 Sentence alignment (JAPA)
Step 2 Part-of-Speech tagging (TreeTagger)
Step 3 Parsing (Syntax)
syntactic dependencies are identified (subjects,
direct and indirect objects of verbs
Step 4 Identification of anchor pairs
cognates, translation equivalents within aligned
sentences
Step 5 Alignment by syntactic propagation

Subject
the two medicinal products are used concurrently.
ces deux produits sont administrés de
manière concomitante.
Subject
13
Comparable resources on medicinal products in
Russia (J. Ivanova et I. Nuk)

Russian Websites selected for the Project
RECIPE http//www.recipe.ru
RLS http//www.rlsnet.ru
Russian Vidal http//www.vidal.ru
Criteria for comparability with SPC
Degree of specialisation
Clarity and precision
Recognition by domain experts in Russia
Information granularity
Style (summarization)
Possible text to text alignment direct search in
Russian by active component or medicinal product

14
The RECIPE Website
The site of legal pharmacological documentation
Medline user manual, index of Russian bio-medical
Websites, several criteria to search for medical
products (including ICD-10) http//www.recipe.ru
15
The RLS Website
Le site RLS, acronym of ??????? ?????????????
??????? ?????? Register of Medical Substances of
Russia http//www.rlsnet.ru/ Encyclopaedia of
medical products and product description.
16
The Russian Vidal Website
Russian Vidal http//www.vidal.ru is edited and
regularly updated by the private company
AstraPharmService in accordance with the
Industrial Standard of Russian Federation.
17
Russian corpus from the PERTOMed Project results
of Correspondence Analysis (Lexico3)
Regardless various origins (different Websites
used to collect information), the descriptions of
medical products in Russian gathered within the
corpus tend to share common lexical
characteristics
18
Material parallel vs. comparable corpora

Difficulties
Corpus size differences
Information coverage?
Degree of comparability?
NLP tools/methods for comparable multilingual
text processing?

Delimiting characters .,!?/_-\"'()
19
Methods for building terminologies from
comparable corpora (1/2)

If two words are mutual translations, their
collocates are likely to correspond as well
Collocation is defined as a co-occurrence
relation.
Domain specific words co-occur with general words
(possibility to use general bilingual
dictionaries).
Mapping through bilingual dictionary
Build context vectors for source and target
words.
Translate context vectors.
Compute similarity between source and target
context vectors.

20
Methods for building terminologies from
comparable corpora (2/2)

Statistical Machine Translation
A translation model is learned from existing
translations (parallel corpora).
Alignment probabilities are introduced to refine
the model.
Limits considerable amounts of training data,
several heuristics possible.
Mixed approaches
Syntactic relations transfer, co-occurrence
relations, dictionary mapping, alignment
probabilities
Problems
Lack of equivalence between tools performing
similar tasks on different languages
Term extraction from comparable corpora not
satisfying yet.

21
Repeated Segments extraction
Repeated Segments (SALEM 1987) series of
consecutive forms whose frequency is greater then
or equal to 2 in the corpus
EN
FR
syndrome de stevens johnson 27
syndrome pseudo grippal 23 de
syndrome de 13 syndrome de lyell 11
syndrome de détresse respiratoire 8
syndrome de stevens johnson et 8
de syndrome de stevens johnson 7 un
syndrome de 7 syndrome
dhyperstimulation ovarienne 7 un
syndrome grippal 5 un syndrome pseudo
grippal 5 syndrome de turner 5
stevens
johnson syndrome 25
respiratory distress syndrome 8
ovarian hyperstimulation syndrome 7
flu like
syndrome 6 multiforme stevens
johnson syndrome 6 adult
respiratory distress syndrome 5 erythema
multiforme stevens johnson syndrome 5
RU
?????????????? ??????? 13
??????? ???????? ???????? 7
??? ??????? 2
??????? ??????????????? ???????? 3
??????? ? ????????????????
????????? 2
??????? ??????? 2
??????? ?????? ??????? 2
??????? ????????????? ????????? 2
22
Multiple co-occurrences
MARTINEZ (2003) The method is based on iterative
calculation of lexical attractions. Filtering
techniques reduce the number of contextual
explorations
A
B
E
G
A
B
E
G
H
F
I
A
B
E
H
C
D
F
I
A
B
C
A
Only non-inclusive paths are selected
D
A
23
Choosing comparable textual units as starting
points for exploration
?? ??????? (F216)
CONTEXT FR troubles du système nerveux
insomnie hypoesthésie paresthésies. EN
nervous system disorders dizziness,
paraesthesia, hyperaesthesia. RU ?? ???????
??????? ??????? ??????????????, ???????? ????,
???????, ?????????, ??????????, ???????????,
???????????, ????????? ???, ???????????, ???????.
24
Exploring collocation networks French
Contexte n1292 (15 formes dont 3 vedettes)
Densité info.0.20 troubles du métabolisme et de
la nutrition augmentation des triglycérides
sériques, augmentation du cholestérol sérique.
Contexte n1414 (12 formes dont 3 vedettes)
Densité info.0.25 troubles du métabolisme et de
la nutrition augmentation de la créatinine,
hypokaliémie. Contexte n1425 (12 formes dont
3 vedettes) Densité info.0.25 troubles du
métabolisme et de la nutrition élévation de
l'urée sanguine. Contexte n3180 (10 formes
dont 3 vedettes) Densité info.0.30 troubles du
métabolisme et de la nutrition oedèmes, oedèmes
périphériques Contexte n4667 (12 formes dont 3
vedettes) Densité info.0.25 troubles du
métabolisme et de la nutrition élévation de
l'urée sanguine. Contexte n6157 (10 formes
dont 3 vedettes) Densité info.0.30 troubles du
métabolisme et de la nutrition fréquent
hypertriglycéridémie, hyperglycémie Contexte
n8334 (13 formes dont 3 vedettes) Densité
info.0.23 troubles du métabolisme et de la
nutrition prise de poids ou amaigrissement,
oedèmes. Contexte n10151 (19 formes dont 3
vedettes) Densité info.0.16 troubles du
métabolisme et de la nutrition très fréquents
perte de poids fréquents perte d'appétit, prise
de poids
Legend f s c f co-frequency s
specificity c number of contexts
25
Exploring collocation networks English
Contexte n32 (4 formes dont 3 vedettes) Densité
info.0.75 metabolism and nutrition
disorders Contexte n3715 (7 formes dont 3
vedettes) Densité info.0.43 metabolism and
nutrition disorders oedema, peripheral
oedema Contexte n3763 (5 formes dont 3
vedettes) Densité info.0.60 metabolism and
nutrition disorders hypokalaemia. Contexte
n5392 (23 formes dont 3 vedettes) Densité
info.0.13 metabolism and nutrition disorders
very common hypercholesterolemia,
hypertriglyceridemia (hyperlipemia)
hypokalaemia increased lactic dehydrogenase
(ldh) common liver function tests abnormal
increased sgot, increased sgpt. Contexte
n7698 (7 formes dont 3 vedettes) Densité
info.0.43 metabolism and nutrition disorders
common hypertriglyceridaemia,
hyperglycaemia Contexte n8856 (11 formes dont
3 vedettes) Densité info.0.27 metabolism and
nutrition disorders abnormal renal function
tests (increased creatinine, bun) Contexte
n9771 (9 formes dont 3 vedettes) Densité
info.0.33 metabolism and nutrition disorders
weight gain or loss, oedema. Contexte
n11578 (13 formes dont 3 vedettes) Densité
info.0.23 metabolism and nutrition disorders
very common weight loss common decreased
appetite, weight increase
Legend f s c f co-frequency s
specificity c number of contexts
26
Exploring collocation networks Russian
Contexte n150 (10 formes dont 4 vedettes)
Densité info.0.40 ?? ??????? ?????? ???????
?????????? ????????? ???????, ????? ???
?????????????. Contexte n832 (37 formes dont
4 vedettes) Densité info.0.11 c10500 ????????
???????? ?? ??????? ?????? ??????? ?????
?????????? ???????????, ??? ? ??? ??????????
?????? ????????????????? ??????????, ????
???????? ????????, ????????????????
????????????????? ? ???????? ????????????, ?????
??? ?????????? ??????????, ??????,
??????????????, ?????????? ???????, ????????????,
???????, ????????, ???????????. Contexte
n930 (18 formes dont 4 vedettes) Densité
info.0.22 ?? ??????? ?????? ???????
??????????????????, ?????????? (????? ??????????)
????? ????, ???????? ??????, ?????????????,
????????????? ??????????, ????????????? ????,
???. Contexte n1439 (10 formes dont 4
vedettes) Densité info.0.40 ?? ??????? ??????
??????? ?????????????, ???????? ??????,
?????????? ????????, ????????????. Contexte
n1459 (12 formes dont 4 vedettes) Densité
info.0.33 ?? ??????? ?????? ???????
?????????????, ?????????? ????????? ???????,
???????? ?? ???????????? ????. /
Contexte n647 (16 formes dont 4 vedettes)
Densité info.0.25 ?? ??????? ???????????????
??????? ???????? ???????, ?????, ?????????
?????????? act, ??? ? ??? ??????? ??????
????????. Contexte n718 (23 formes dont 4
vedettes) Densité info.0.17 c752 ????????
???????? ?? ??????? ??????????????? ???????
???????? ???? ? ?????????? ? ??????????????
???????, ???????, ?????, ??????, ????????
????????, ????????? ?????????? ??????????
???????????. Contexte n873 (31 formes dont 4
vedettes) Densité info.0.13 ?? ???????
??????????????? ??????? ?????-?????????
?????????? ??? ????????-????????? ??????????
???, ???, ?? ? ?????? ?????? ??????????, ???????,
?????, ??????, ???? ? ?????? ? ?????????
???????-???????, ??????? ?????????????????
???????.
Legend f s c f co-frequency s
specificity c number of contexts
27
Combining with French-English lexicon extracted
from parallel corpus

Automatic segmentation into textual units
(Lexico3)
Forms, Repeated Segments (Russian-French-English)
Identification of anchor pairs (starting points)
Frequency counts, cognates, general words,
French/English terminology
(syndrome / syndrome / ???????)
Trilingual collocation networks (COOCS)
Identification of similar context vectors.
Semi-automatic segmentation into terminological
units.
Cross-language check.
Expert validation.

28
Building trilingual terminology collaboration
domain expert/corpus linguist

Two different kinds of knowledge / skills
From corpus linguist
Methodological knowledge tools and methods for
text exploration.
Quantitative results on corpora.
From domain expert
Domain specific knowledge on ADRs.
Choice of relevant terms/contexts when several
variants attested in texts.

29
Results trilingual lexicon on the Web
PERTOMed Server http//baneyx.net/SPIP/
Each trilingual entry comprises the following
fields - Simple term (with possible variants) -
Abbreviation (if applicable) - Related composed
term(s) - Domain(s) - Medical product(s) concerned
30
Results choosing terms
?????????????? ??????? / syndrome pseudo-grippal
/ influenza-like symptom, flu-like symptom
31
Results choosing domains
32
Discussion positive results

430 validated trilingual terminological entries
in XML format
2002 simple terminological records (single word
terms)
1006 complex terminological records (50)
(multiword terms)
Co-occurrence relations

33
Discussion limits

Lexical coverage.
Contextual access.
Presentation (no visual aids for navigation
yet).
Evaluation difficulties
Choice of criteria.
Comparable resources needed.

34
Conclusions

Creating terminological resources from comparable
corpora is faced with intrinsic heterogeneity of
texts.
The challenge of exploring texts coming from
different cultural and linguistic sources should
be taken into account in the terminology project
feasibility study.
Creation of Russian Internet corpus in the field
of Pharmacovigilance is a pioneering work.
The use of textometric approach for comparable
corpora exploration gives encouraging results.
Our methods should be improved taking into
account the availability of new tools / resources
for processing Russian texts.

35
Future work

Intertextual exploration on the document level
based on visual aids

DISTRIBUTION INVENTORY OF REPEATED SEGMENTS
216 ---- ---- ---- ---- ---- ?? ???????
2 ---- ?? ??????? ??????????? ?
?????????????? ??????? ??????? 16 ----
---- ---- ---- ?? ??????? ???
3 ---- ?? ??????? ??? ???????? ????
?????????????? 3 ----
?? ??????? ??? ? ?????????????? ??????? ???????
4 ---- ---- ---- ?? ??????? ???
???????? 2 ---- ??
??????? ??? ???????? ?????????? ????????????
???????? ???? 5 ---- ---- ---- ??
??????? ??????????? ??????? 2
---- ---- ?? ??????? ??????????? ???????
???????? 10 ---- ---- ---- ??
??????? ?????? ???????? 2
---- ---- ?? ??????? ?????? ???????? ????????
2 ---- ---- ?? ??????? ??????
???????? ???? 4 ---- ---- ??
??????? ?????? ???????? ??????? 2 ----
---- ---- ---- ?? ??????? ????? 2
---- ---- ---- ?? ??????? ????????????
??????????? 4 ---- ---- ---- ??
??????? ????????????????? ???????
36
Publications on the PERTOMed Project

Baneyx A., Charlet J., Jaulent M.-C. (2005)
"Building medical ontologies based on terminology
extraction from texts methodological
propositions". In Proceedings of the 10th
Conference on Artificial Intelligence in Medicine
in Europe, Lecture Notes in Computer Science,
Aberdeen, GB, July 2005. Springer.
Jaulent M.-C., Charlet J. (2006) "PERTOMed
Production et évaluation de ressources
terminologiques et ontologiques dans le domaine
de la médecine". PERTOMed Rapport de fin de
projet, INSERM U729.
Nuk I., Ivanova J. (2005) "Création dune
terminologie français/russe dans le domaine de la
pharmacovigilance". Mémoire de DESS (dir. Monique
Slodzian), Centre de Recherche en Ingénierie
Multilingue, INaLCO.
Ozdowska S., Névéol A., Thirion B. (2005)
"Traduction compositionnelle automatique de
bitermes dans des corpus anglais/français
alignés". Actes de la Conférence Terminologie et
Intelligence Artificielle, TIA'05, Rouen, France.