Website Term Browser Un sistema interactivo y multiling - PowerPoint PPT Presentation

About This Presentation
Title:

Website Term Browser Un sistema interactivo y multiling

Description:

Departamento de Lenguajes y Sistemas Inform ticos UNIVERSIDAD NACIONAL DE EDUCACI N A DISTANCIA TESIS DOCTORAL Website Term Browser Un sistema interactivo y ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 46
Provided by: Anse151
Category:

less

Transcript and Presenter's Notes

Title: Website Term Browser Un sistema interactivo y multiling


1
Website Term BrowserUn sistema interactivo y
multilingüe de búsqueda textual basado en
técnicas lingüísticas
Departamento de Lenguajes y Sistemas
Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A
DISTANCIA TESIS DOCTORAL
  • Anselmo Peñas Padilla
  • Directores
  • Julio Gonzalo Arroyo
  • María Felisa Verdejo Maíllo

2
Structure
  • I. Problem definition and goals
  • II. Experiments in Lexical Ambiguity and Indexing
  • III. Website Term Browser
  • IV. Evaluation framework

3
Classic Information Retrieval
I. Problem definition and goals
  • Retrieve documents relevant to users information
    need
  • Pre-supposes
  • Static information needs
  • Value is found in the retrieved set of documents
    (not in searching process)
  • Ignores
  • Task (purpose) that origins the information need
  • Changes in the information needs
  • Interactivity
  • Imprecise information needs
  • Users develop strategies without system aid
  • Help users to express and precise their
    information needs

Information need
4
Language barriers
I. Problem definition and goals
  • Problems in query formulation
  • Users dont know the appropriate domain
    terminology
  • Users cant express their information need in a
    foreign language
  • Translinguality
  • Natural Language characteristics
  • Lexical ambiguity
  • Terminology variation
  • Help users to overcome language barriers

5
General approaches
I. Problem definition and goals
Information Retrieval
Natural Language Processing
6
Natural Language Processing
I. Problem definition and goals
  • Help users to express and precise their
    information needs?
  • Open field in IR
  • Help users to overcome language barriers?
  • Phrase extraction and normalization
  • Explicit disambiguation (POS, WSD)
  • Bad strategies or too much error in automatic
    processing?
  • Conceptual indexing

7
Goals
I. Problem definition and goals
  • Study the role of automatic linguistic techniques
    within classic IR model
  • Phrase indexing, POS tagging, WSD
  • Semantic distinction of phrases
  • Viability of conceptual indexing
  • Section II Experiments in Lexical Ambiguity and
    Indexing

8
Goals
I. Problem definition and goals
  • Develop a model
  • to help users to express and precise their
    information needs
  • to help users to overcome language barriers
  • Bringing to users the collection terminology
  • Morpho-syntactic, semantic translingual
    variations
  • Without needs of thesauri construction
  • Establish an appropriate evaluation framework
  • Sections III IV Website Term Browser

9
Proposed approach
Information Retrieval
Natural Language Processing
10
Structure
  • I. Problem definition and goals
  • II. Experiments in Lexical Ambiguity and Indexing
  • III. Website Term Browser
  • IV. Evaluation framework

11
Contents
II. Experiments in Lexical Ambiguity and Indexing
  • Morpho-syntactic ambiguity in IR
  • Phrase indexing
  • Semantic distinction of lexical compounds in IR
  • Conceptual indexing
  • ITEM Search Engine
  • Conclusions

IR-SEMCOR, hand annotated test collection
12
Morpho-syntactic ambiguity in IR
II. Experiments in Lexical Ambiguity and Indexing
Texts ...particle crosses the wall... ...canadian
red cross... ...boat to cross mississippi river...
13
(No Transcript)
14
Phrase indexing
II. Experiments in Lexical Ambiguity and Indexing
Texts ...a guide for the fisher
who... ...information on cat care... ...arboreal
carnivorous called fisher cat...
15
(No Transcript)
16
Semantic distinction of compounds
II. Experiments in Lexical Ambiguity and Indexing
Types of lexical compounds
  • Automatic classification through WordNet
  • Endocentric one component is hyperonym
  • Appositional all components are hyperonyms
  • Exocentric no components are hyperonyms

17
(No Transcript)
18
Conceptual Indexing
II. Experiments in Lexical Ambiguity and Indexing
Texts ...spring... ...muelle... ...spring... ...fo
untain... ...fuente... ...spring... ...springtime.
.. ...primavera...
Conceptual Index n03114639 n05727069 n09151839
Query spring
  • This model can improve text retrieval (Gonzalo
    1998 Gonzalo 1999)
  • Depending on WSD error rate

19
Synset indexing with no errors in WSD
20
Conceptual Indexing
II. Experiments in Lexical Ambiguity and Indexing
  • Although explicit disambiguation strategies
    applied to Indexing
  • POS tagging
  • Phrase indexing
  • Word Sense Disambiguation
  • dont produce a significative improvement in IR
  • Conceptual indexing based on synsets
  • Needs automatic WSD accuracy near to
    state-of-the-art (60)
  • Permit Cross-Language Information Retrieval
  • Qualitative evaluation justifies a prototype
    development

21
Textual representation query is translated into
the target language Conceptual representation
query and documents are compared at a conceptual
level
Selection of newspaper determines the target
language
Selection of query language
Selection of WSD strategy
Retrieved documents
22
ITEM Search Engine
II. Experiments in Lexical Ambiguity and Indexing
  • Conceptual indexing seems atractive but there are
    some unsolved challenges
  • Low accuracy in Word Sense Disambiguation due to
  • Unrestricted domains in EWN
  • Fine grain distinction of senses
  • Indexing units ? translation units
  • Loss of information in word by word
    disambiguation
  • High cost, low benefit
  • Users perceive a slower and less transparent
    system

23
Conclusions
II. Experiments in Lexical Ambiguity and Indexing
  • Dont subordinate NLP to classic IR model
  • Even an improvement of 10 wouldnt change users
    perception
  • Think of users
  • Find new paradigms in Information Access
  • In a higher level, closer to users
  • Consider users tasks
  • Consider users interaction
  • New places for NLP techniques in IR
  • Interaction over partial NLP processing
  • A proposal Terminology Retrieval Term Browsing

24
Structure
  • I. Problem definition and goals
  • II. Experiments in Lexical Ambiguity and Indexing
  • III. Website Term Browser
  • IV. Evaluation framework

25
Contents
III. Website Term Browser
  • Terminology Retrieval
  • Term extraction
  • Indexing
  • Retrieval model
  • Query expansion and translation
  • Website Term Browser interface

26
Terminology Retrieval
III. Transition to an interactive model
  • Term Browsing
  • Navigate through relevant terminology
  • Access information from retrieved terms
  • Terminology Retrieval
  • Retrieve relevant terms related to the query
  • Phrase extraction
  • Phrase indexing
  • Phrase retrieval
  • Recall is more important than precision in term
    extraction
  • Relaxing linguistic processing is possible
  • Premise dont lose phrases

27
Term extraction
III. Transition to an interactive model
  • Syntactic pattern (Spanish, English, French,
    Italian, Catalan)
  • phr_content phr_closed phr_content
    phr_content
  • phr_content noun, adjective, number, infinitive,
    participle
  • phr_closed article, preposition, conjunction
  • Needs POS tagging
  • High computational cost
  • Tagging oriented to phrase detection

28
Indexing
III. Transition to an interactive model
  • Steps
  • Text pre-processing and listing of words
  • Word tagging (oriented to phrase detection)
  • Phrase detection lemmatization of components
  • Document indexing statistics (document
    frequency)
  1. Phrase selection (Subsumption Lexicalization
    degree)
  2. Phrase indexing

29
Retrieval model
III. Transition to an interactive model
query
30
Query expansion and translation
III. Transition to an interactive model
  • Tratados
  • acuerdo
  • capitulación
  • concertación
  • convenio
  • cuidar, pacto
  • manejar
  • procesar
  • accord
  • discourse
  • handle
  • manage
  • pact
  • process
  • treat
  • treatise
  • treaty

Prohibición embargo entredicho interdicción interd
icto proscripción ban interdiction prohibition
proscription
Pruebas cata, catadura degustación ensayo escandal
lo experimento gustación muestreo,
tanteo demonstrate establish, exhibit experiment
experimentation fall, fitting indicate,
point present, proof prove, run sample,
sampling shew,show, taste test, trial, try
de
Nucleares nuclear nuclear
de
Expansion
Translation
31
Query in Spanish
Hierarchy of terms
Ranking of documents
English
Spanish
Catalan
32
(No Transcript)
33
Structure
  • I. Problem definition and goals
  • II. Experiments in Lexical Ambiguity and Indexing
  • III. Website Term Browser
  • IV. Evaluation framework

34
Evaluation of Terminology Retrieval
V. Evaluation framework
  • Compare
  • Terminology Retrieval
  • Hand-crafted Multilingual Thesaurus

35
(No Transcript)
36
Evaluation of Terminology Retrieval
V. Evaluation framework
  • Recall of mono-lexical terms (lemmas)
  • Monolingual 85 - 95
  • Translingual 55 - 65
  • Recall of poly-lexical terms (phrases)
  • Monolingual 40 - 65
  • Translingual 10 - 45
  • Loss of recall due to
  • Phrase extraction (mainly POS tagging) 3 - 17
  • Phrase indexing (mainly lemmatization) 2 - 34
  • Phrase selection 12 - 37
  • Lack of connections between different languages
    in EWN
  • Lack in EWN adjective hierarchies

37
Usefulness of Term Browsing
V. Evaluation framework
  • Previous experiences in interactivity evaluation
    (TREC) need
  • Precise queries
  • Laboratory conditions
  • Controlled users
  • There arent differences between systems
  • Identify better approaches is not possible
  • A new framework is here proposed
  • Real work environment
  • Register users interaction
  • Compare the use of
  • Term area provided by WTB
  • Document ranking provided by Google

38
QUERY
RECONSULT WITH TERM
EXPLORE TERM
EXPLORE DOCUMENT
39
Usefulness of Term Browsing
V. Evaluation framework
  • 2318 sessions with interaction
  • An average of 5.16 actions per session
  • EXPLORE_TERM is used in 65

LOG FILE 539 2001/03/14 121033 QUERY UNED
193.146.241.164 ozone hole 2001/03/14 121120
EXPLORE_TERM 539684 degradación de la capa de
ozono 2001/03/14 121129 EXPLORE_DOC
http//www.uned.es/doctorado/0108.htm ... EXPL
ORE_TERM RECONSULT EXPLORE_DOC ...
40
Usefulness of Term Browsing
V. Evaluation framework
  • All queries 1 word
    queries gt1 word queries
  • First action EXPLORE_DOC 42 47
    39
  • after QUERY EXPLORE_TERM 51 45
    55
  • RECONSULT 7 8 6
  • Last action
  • before finishing QUERY 50 57
    46
  • the session with EXPLORE_TERM 44
    38 47
  • explore DOC RECONSULT 6 5
    7

41
Structure
  • I. Problem definition and goals
  • II. Experiments in Lexical Ambiguity and Indexing
  • III. Website Term Browser
  • IV. Evaluation framework

42
Conclusions
  • Lexical Ambiguity has been studied using
    IR-Semcor
  • Evaluation free of automatic processing errors
  • Explicit disambiguation at indexing doesnt seem
    to improve retrieval (POS, WSD, Semantic
    distinction of lexical compounds)
  • Conceptual indexing based on EuroWordNet synsets
    needs to solve some challenges
  • Think of users to find new places for NLP

43
Conclusions
  • A search model based on extraction, retrieval and
    browsing of terminology has been developed
  • User oriented
  • Interaction over terminological information
  • Intermediate way between free-searching and
    thesaurus-guided searching
  • Without needs of thesaurus construction
  • Bringing to users the collection terminology
  • Morpho-syntactic semantic variations
  • Translinguality

44
Conclusions
  • An evaluation framework for Terminology Retrieval
    and Term Browsing has been established
  • Points the way to improve Terminology Retrieval
  • Users appreciate Term Browsing
  • WTB phrasal information can substantially
    complement the document ranking provided by the
    search engines

45
Website Term BrowserUn sistema interactivo y
multilingüe de búsqueda textual basado en
técnicas lingüísticas
Departamento de Lenguajes y Sistemas
Informáticos UNIVERSIDAD NACIONAL DE EDUCACIÓN A
DISTANCIA TESIS DOCTORAL
  • Anselmo Peñas Padilla
  • Directores
  • Julio Gonzalo Arroyo
  • María Felisa Verdejo Maíllo
Write a Comment
User Comments (0)
About PowerShow.com