Applying a Lexical Similarity Measure to Compare Portuguese Term Collections

About This Presentation

Title:

Applying a Lexical Similarity Measure to Compare Portuguese Term Collections

Description:

Applying a Lexical Similarity Measure to Compare Portuguese Term Collections ... Vera L cia Strube de Lima. Programa de P s-Gradua o. em Ci ncia da Computa o ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 21

Provided by: marciriosi

Category:

more less

Transcript and Presenter's Notes

Title: Applying a Lexical Similarity Measure to Compare Portuguese Term Collections

1
Applying a Lexical Similarity Measure to Compare
Portuguese Term Collections
Programa de Pós-Graduação em Ciência da
Computação Faculdade de Informática Pontifícia
Universidade Católica - RS

Marcirio Silveira Chaves
Vera Lúcia Strube de Lima

2
Context I

Ontological Structures and Ontologies
Reuse of Knowledge
Semantic Web

3
Context II

Mapping of OSs
Similarity between terms (distinct matizes)
Similarity Measures
Edition Distance
Two levels of similarity
The Portuguese language

4
Summary

Context
Research Question
Similarity Measures
Lexical Similarity Measure
Experiments
Final Remarks and Future Work

5
Research Question

How to map similar concepts between distinct
ontological structures?
Hypothesis
There is a degree of similarity among ontological
structures independently created, which can be
detected in order to allow a mapping.

6
Similarity Measures I

Lexical Level
Edit Distance (Levenshtein 1966) - ED
The minimum number of
insertions,
deletions or
substitutions (reversals)
necessary to transform one string into another
using a dynamic programming algorithm.
e.g. ED (ocidente, oriente) 2

7
Similarity Measures II

Lexical Level
String Matching (Maedche e Staab 2002) - SM
e.g.

8
Lexical Similarity Measure I
k reaches the amount of words of the term with
the minimum number of words.
9
Lexical Similarity Measure II

Example

10
Characteristics of the Experiment

Multidomain Experiment
OSs from
Brazilian Senate Thesaurus (OSA)
São Paulo University - USP Thesaurus (OSB)
single-word and multiword terms
validation and evaluation
Terms in OSA categorized into two sets for each
phase
Terms in OSB remained without categorization
during both phases.

11
First Letter Heuristic
12
Relevant Numbers

Evaluation phase
single-word terms
1,823 from OSA
7,039 from OSB
multiword terms
4,701 from OSA
16,986 from OSB.
2,887 pairs of terms similar (using SM or LS)

13
Evaluation

Analysis of Results

14
Analysis of Group G5

Most of the pairs analyzed during the evaluation
phase (73)
907 single-word terms
1,211 multiword terms.
Extract of group G5 single-word and multiword
terms

15
Same Domain Experiment I

GEODESC Thesaurus (2,083)
USP Thesaurus (429 terms ) belonging to the
Geosciences domain.

16
Same Domain Experiment II

Analysis of Group A

Pairs of terms considered dissimilar by SM and
similar by LS
17
Same Domain Experiment III

Analysis of Group A

Pairs of terms considered similar by SM and LS
18
Same Domain Experiment IV

Analysis of Group B

Multi and single word pairs of terms
19
Same Domain Experiment V

Analysis of Group B

Contribution of the penalties without first
letter heuristic
20
Final Remarks and Future Work

About this work
Creation, validation and evaluation of LS measure
One of the first efforts to deal with Portuguese
term collections
Experiments with terms belonging to multidomain
as well as to the same domain structures
Future Work
Application of LS measure to other languages
Treatment of semantic-structural level in the
structures