Title: Applying a Lexical Similarity Measure to Compare Portuguese Term Collections
1Applying a Lexical Similarity Measure to Compare
Portuguese Term Collections
Programa de Pós-Graduação em Ciência da
Computação Faculdade de Informática Pontifícia
Universidade Católica - RS
- Marcirio Silveira Chaves
- Vera Lúcia Strube de Lima
2Context I
- Ontological Structures and Ontologies
- Reuse of Knowledge
- Semantic Web
3Context II
- Mapping of OSs
- Similarity between terms (distinct matizes)
- Similarity Measures
- Edition Distance
- Two levels of similarity
- The Portuguese language
4Summary
- Context
- Research Question
- Similarity Measures
- Lexical Similarity Measure
- Experiments
- Final Remarks and Future Work
5Research Question
- How to map similar concepts between distinct
ontological structures? - Hypothesis
- There is a degree of similarity among ontological
structures independently created, which can be
detected in order to allow a mapping.
6Similarity Measures I
- Lexical Level
- Edit Distance (Levenshtein 1966) - ED
- The minimum number of
- insertions,
- deletions or
- substitutions (reversals)
- necessary to transform one string into another
using a dynamic programming algorithm. - e.g. ED (ocidente, oriente) 2
7Similarity Measures II
- Lexical Level
- String Matching (Maedche e Staab 2002) - SM
- e.g.
8Lexical Similarity Measure I
k reaches the amount of words of the term with
the minimum number of words.
9Lexical Similarity Measure II
10Characteristics of the Experiment
- Multidomain Experiment
- OSs from
- Brazilian Senate Thesaurus (OSA)
- São Paulo University - USP Thesaurus (OSB)
- single-word and multiword terms
- validation and evaluation
- Terms in OSA categorized into two sets for each
phase - Terms in OSB remained without categorization
during both phases.
11First Letter Heuristic
12Relevant Numbers
- Evaluation phase
- single-word terms
- 1,823 from OSA
- 7,039 from OSB
- multiword terms
- 4,701 from OSA
- 16,986 from OSB.
- 2,887 pairs of terms similar (using SM or LS)
13Evaluation
14Analysis of Group G5
- Most of the pairs analyzed during the evaluation
phase (73) - 907 single-word terms
- 1,211 multiword terms.
- Extract of group G5 single-word and multiword
terms
15Same Domain Experiment I
- GEODESC Thesaurus (2,083)
- USP Thesaurus (429 terms ) belonging to the
Geosciences domain.
16Same Domain Experiment II
Pairs of terms considered dissimilar by SM and
similar by LS
17Same Domain Experiment III
Pairs of terms considered similar by SM and LS
18Same Domain Experiment IV
Multi and single word pairs of terms
19Same Domain Experiment V
Contribution of the penalties without first
letter heuristic
20Final Remarks and Future Work
- About this work
- Creation, validation and evaluation of LS measure
- One of the first efforts to deal with Portuguese
term collections - Experiments with terms belonging to multidomain
as well as to the same domain structures - Future Work
- Application of LS measure to other languages
- Treatment of semantic-structural level in the
structures