Corpusbased Terminology Extraction applied to Information Access - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Corpusbased Terminology Extraction applied to Information Access

Description:

(Over both corpora, Educational Resources and International News) Processing. Tokenising ... embargo. entredicho. interdicci n. interdicto. proscripci n. ban ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 36
Provided by: Anse
Category:

less

Transcript and Presenter's Notes

Title: Corpusbased Terminology Extraction applied to Information Access


1
Corpus-based Terminology Extraction applied to
Information Access
Corpus Linguistics 2001, Lancaster, UK
  • Anselmo Peñas, Felisa Verdejo and Julio Gonzalo
  • NLP Group, Dpto. Lenguajes y Sistemas
    Informáticos,
  • UNED, Spain

2
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology-based Information Access
  • Conclusions

3
Introduction Framework
  • The European Treasury Browser (ETB) project
  • Web site of Educational Resources (primary and
    secondary school)
  • Context of New Technologies
  • Objective to build the structures to organise
    and retrieve educational resources
  • Similar systems
  • The Educational Resources Information Centre
  • The British Education Index

4
(No Transcript)
5
(No Transcript)
6
Introduction use of Thesauri
  • Thesauri
  • Definition controlled vocabulary, structured in
    relations
  • Structure descriptors and relations (NT, BT,
    RT)
  • Existing educational thesauri
  • Dont cover primary and secondary school
    vocabulary within the new technologies context
  • Construction of a multilingual thesaurus is
    needed for the ETB project purposes

Terminology Lists
7
Objectives of the work
  • To build the Spanish list of candidate terms for
    the ETB multilingual thesaurus.
  • To develop a general procedure to obtain
    terminology lists
  • In an automatic way
  • Independently of the application domain
  • To explore effective ways of Information
    Retrieval
  • using the terminology lists instead of thesaurus
  • to bridge the gap between users and collection
    languages

8
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology based Information Access
  • Conclusions

9
Resources and Tools
  • Resources
  • Semantic network EuroWordNet
  • Monolingual dictionary (VOX)
  • Bilingual dictionary (VOX)
  • Tools
  • Tokeniser
  • Morphological analyser
  • POS tagger
  • Shallow parser (based on syntactic patterns)

10
Corpora
  • Corpus of educational resources
  • 1,075 documents (670,646 words) from
  • Programa de Nuevas Tecnologías
  • (http//www.pntic.mec.es/main_recursos.html)
  • Aldea Global
  • (http//sauce.pntic.mec.es/alglobal)
  • Corpus of international news
  • 7,364 documents (2.9 million words)
  • (http//www.elpais.es/internac)
  • Pre-processing
  • (html tags treatment, language detection,
    detection of repeated pages and chunks, etc.)

11
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology based Information Access
  • Conclusions

12
Terminology Extraction (TE)
  • Terminology List
  • List of mono-lexical and poly-lexical terms
    which are usual in a specific domain
  • Steps of Terminology Extraction
  • 1. Term detection
  • 2. Term weighting
  • 3. Term selection

13
1. Term Detection (mono-lexical)
  • (Over both corpora, Educational Resources and
    International News)
  • Processing
  • Tokenising
  • Lemmatising,Tagging
  • Removal of erroneous strings, abbreviations and
    words from other languages
  • Extraction of nouns, verbs and adjectives
  • Result
  • List of candidate lemmas with its
  • Term frequency (any form) in both collections
  • Document frequency in both collections

14
1. Term Detection (poly-lexical)
  • (Over Educational Resources corpus)
  • Processing
  • Tokenising, Lemmatising,Tagging
  • Shallow parsing (Syntactic pattern recognition)

Syntactic Patterns for Spanish terminological
phrases N N N A N A Prep N A N A
Prep Art N A N A Prep V N A Prep V N
A
  • Result
  • List of candidate terminological phrases
  • Term frequency in the collection
  • Document frequency in the collection

15
2. Term weighting
  • Empirical measure
  • Proportional to
  • term frequency
  • document frequency
  • Inversely proportional to
  • term frequency in other domain
  • Normalisation

16
3. Term Selection
  • Removal of unfrequent terms in the study domain
  • Removal of very frequent terms in other domains
  • Ranking of terms according to their weight
  • Selection of top terms in the terminology list
    (thresholds to obtain 2,000 / 3,000 terms from
    the ?75,000 detected terms)
  • Addition of phrases with relevant components

17
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology based Information Access
  • Conclusions

18
Evaluation Visual exploration
  • Automatic generation of result pages in HTML
  • Purpose
  • To help in the decisions of the prototype
    development
  • To evaluate the measures and techniques and to
    suggest improvements or modifications
  • To give further information to documentalists in
    order to assist final decisions in thesaurus
    construction

19
(No Transcript)
20
(No Transcript)
21
Evaluation Visual exploration
22
Evaluation Precision
Proyectos curriculares (Proyecto curricular)
  • Manual classification of the 2,856 selected terms

Proyecto curricular
Ciencias sociales
Sistema operativo
Profesorado materiales ?
Alumnos ingleses
Biblioteca nacional
With a low effort, a large number of accurate
terms is proposed to documentalists
23
Evaluation Precision
  • With a lower number of candidates, the precision
    increases

24
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology-based Information Access
  • Conclusions

25
Terminology-based Information Access
  • Terminology Extraction in Information Retrieval
    provides
  • At Indexing to add poly-lexical terms to the
    indexes without the explosion of n-grams
  • Term browsing to navigate through the
    terminology and access the documents from the
    terms (without the use of thesauri)

26
Terminology-based Information Access
  • A difference with TE terminology list truncation
  • (as query gives the relevant terms, now the task
    is concerned with recall rather than precision of
    terms)
  • A new task to retrieve terminology
  • Poly-lexical terms are retrieved from
    mono-lexical ones

27
(No Transcript)
28
(No Transcript)
29
Terminology-based Information Access
  • Terminology retrieval
  • To bridge the gap between
  • Collection terminology
  • Query terms
  • Requires
  • Query expansion
  • Query translation
  • But produces noise in the retrieval
  • However phrases provides an excellent way for
    ambiguity reduction (Ballesteros Croft, 1998)

30
(No Transcript)
31
Terminology-based Information Access
  • Tratados
  • acuerdo
  • capitulación
  • concertación
  • convenio
  • cuidar, pacto
  • manejar
  • procesar
  • accord
  • discourse
  • handle
  • manage
  • pact
  • process
  • treat
  • treatise
  • treaty

Prohibición embargo entredicho interdicción interd
icto proscripción ban interdiction prohibition
proscription
Pruebas cata, catadura degustación ensayo escandal
lo experimento gustación muestreo,
tanteo demonstrate establish, exhibit experiment
experimentation fall, fitting indicate,
point present, proof prove, run sample,
sampling shew,show, taste test, trial, try
de
Nucleares nuclear nuclear
de
Expansion
Translation
Nuclear test ban treaty?
Nuclear fitting interdiction manage? Nuclear
taste proscription process?
32
(No Transcript)
33
Content
  • Introduction
  • Resources, Tools and Corpora
  • Terminology Extraction (TE)
  • Evaluation of the TE procedure
  • Terminology based Information Access
  • Conclusions

34
Conclusions
  • Extraction of relevant terms in Spanish for the
    ETB project domain (primary and secondary school
    / new technologies)
  • Automatic process from free resources as web
    pages
  • Exploring contexts and statistical data via
    Internet
  • Development of a search engine based on
    terminology extraction
  • Using terminology lists in an intermediate way
    between free-searching and thesaurus-guided
    searching
  • Without needing of thesaurus construction
  • Bridging the distance between the terms used in
    the query and the terminology used in the
    collection (even in different languages)

35
Thanks for your attention
Write a Comment
User Comments (0)
About PowerShow.com