Applying a Lexical Similarity Measure to Compare Portuguese Term Collections - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Applying a Lexical Similarity Measure to Compare Portuguese Term Collections

Description:

Applying a Lexical Similarity Measure to Compare Portuguese Term Collections ... Vera L cia Strube de Lima. Programa de P s-Gradua o. em Ci ncia da Computa o ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 21
Provided by: marciriosi
Category:

less

Transcript and Presenter's Notes

Title: Applying a Lexical Similarity Measure to Compare Portuguese Term Collections


1
Applying a Lexical Similarity Measure to Compare
Portuguese Term Collections
Programa de Pós-Graduação em Ciência da
Computação Faculdade de Informática Pontifícia
Universidade Católica - RS
  • Marcirio Silveira Chaves
  • Vera Lúcia Strube de Lima

2
Context I
  • Ontological Structures and Ontologies
  • Reuse of Knowledge
  • Semantic Web

3
Context II
  • Mapping of OSs
  • Similarity between terms (distinct matizes)
  • Similarity Measures
  • Edition Distance
  • Two levels of similarity
  • The Portuguese language

4
Summary
  • Context
  • Research Question
  • Similarity Measures
  • Lexical Similarity Measure
  • Experiments
  • Final Remarks and Future Work

5
Research Question
  • How to map similar concepts between distinct
    ontological structures?
  • Hypothesis
  • There is a degree of similarity among ontological
    structures independently created, which can be
    detected in order to allow a mapping.

6
Similarity Measures I
  • Lexical Level
  • Edit Distance (Levenshtein 1966) - ED
  • The minimum number of
  • insertions,
  • deletions or
  • substitutions (reversals)
  • necessary to transform one string into another
    using a dynamic programming algorithm.
  • e.g. ED (ocidente, oriente) 2

7
Similarity Measures II
  • Lexical Level
  • String Matching (Maedche e Staab 2002) - SM
  • e.g.

8
Lexical Similarity Measure I
k reaches the amount of words of the term with
the minimum number of words.
9
Lexical Similarity Measure II
  • Example

10
Characteristics of the Experiment
  • Multidomain Experiment
  • OSs from
  • Brazilian Senate Thesaurus (OSA)
  • São Paulo University - USP Thesaurus (OSB)
  • single-word and multiword terms
  • validation and evaluation
  • Terms in OSA categorized into two sets for each
    phase
  • Terms in OSB remained without categorization
    during both phases.

11
First Letter Heuristic
12
Relevant Numbers
  • Evaluation phase
  • single-word terms
  • 1,823 from OSA
  • 7,039 from OSB
  • multiword terms
  • 4,701 from OSA
  • 16,986 from OSB.
  • 2,887 pairs of terms similar (using SM or LS)

13
Evaluation
  • Analysis of Results

14
Analysis of Group G5
  • Most of the pairs analyzed during the evaluation
    phase (73)
  • 907 single-word terms
  • 1,211 multiword terms.
  • Extract of group G5 single-word and multiword
    terms

15
Same Domain Experiment I
  • GEODESC Thesaurus (2,083)
  • USP Thesaurus (429 terms ) belonging to the
    Geosciences domain.

16
Same Domain Experiment II
  • Analysis of Group A

Pairs of terms considered dissimilar by SM and
similar by LS
17
Same Domain Experiment III
  • Analysis of Group A

Pairs of terms considered similar by SM and LS
18
Same Domain Experiment IV
  • Analysis of Group B

Multi and single word pairs of terms
19
Same Domain Experiment V
  • Analysis of Group B

Contribution of the penalties without first
letter heuristic
20
Final Remarks and Future Work
  • About this work
  • Creation, validation and evaluation of LS measure
  • One of the first efforts to deal with Portuguese
    term collections
  • Experiments with terms belonging to multidomain
    as well as to the same domain structures
  • Future Work
  • Application of LS measure to other languages
  • Treatment of semantic-structural level in the
    structures
Write a Comment
User Comments (0)
About PowerShow.com