A Suite to Compile and Analyze an LSP Corpus - PowerPoint PPT Presentation

About This Presentation
Title:

A Suite to Compile and Analyze an LSP Corpus

Description:

Usage Examples: Terminology extraction. Bilingual lexicon extraction. Neologisms extraction ... It usage save time and effort in the analysis of a corpus ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 22
Provided by: institutun
Learn more at: http://www.lrec-conf.org
Category:
Tags: lsp | analyze | compile | corpus | suite

less

Transcript and Presenter's Notes

Title: A Suite to Compile and Analyze an LSP Corpus


1
A Suite to Compile and Analyze an LSP Corpus
  • 6th International Conference
  • on Language Resources and Evaluation
  • LREC 2008

Rogelio Nazar Jorge Vivaldi M. Teresa Cabré
rogelio.nazar jorge.vivaldi teresa.cabre_at_upf.e
du
2
Introduction
  • This system (JAGUAR) is a set of tools for
    compiling and exploring an LSP corpus from the
    web
  • http//jaguar.iula.upf.edu
  • Usage Examples
  • Terminology extraction
  • Bilingual lexicon extraction
  • Neologisms extraction
  • Architecture a system divided in two main
    modules
  • Compilation of an LSP corpus from the web
  • Analysis of the corpus with statistical
    techniques

3
Module 1 Compilation of an LSP corpus from the
web
  • Document retrieval by querying search engines
  • Classification of the collection on the basis of
    two axis
  • Degree of relevance to the topic
  • Possibility of corpus tuning with user feedback
  • Degree of specialization of the document
  • Structure of the document (abstract,
    introduction, etc.)
  • System for bibliographical references, etc.

Final classification is the result of the
combination of these factors.
4
Module 1 Compilation of an LSP corpus from the
web
Classification by degree of relevance to the
topic
5
Module 1 Compilation of an LSP corpus from the
web
Classification by degree of relevance to the
topiccoocurrence graphs
6
Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Cumulative precision in the ranking of documents
with the term spastic diplegia.
7
Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Precision and Recall for the experiments.
8
Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Probability distribution of precision as a random
variable (performance of 10.000 random
classifiers).
9
Module 2 Analysis of the corpus with statistical
techniques
  • 1. Input from module 1 or from user compiled
    corpus
  • 2. Main functions
  • Measures of vocabulary richness
  • Analysis of sample representativeness
  • Automatic language recognition
  • Kwic search
  • N-grams extraction and sorting
  • Collocations extraction
  • Measures of association
  • Models of term distribution
  • Coefficients for vector comparison

10
  • http//rc16.upf.es/jaguar

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
  • Conclusions
  • We have presented the system JAGUAR, set of tools
    for compiling and exploring an LSP corpus from
    the web
  • The main characteristics of this suit are the
    following
  • It is able to collect an LSP corpus from the
    web, ensuring the thematic adequacy and degree of
    specialization to a given domain
  • It offers tools to statistically explore such
    collection in a friendly interface
  • It has also been conceived as a library
  • The original algorithms have been successfully
    evaluated
  • It usage save time and effort in the analysis of
    a corpus offering also new insights, a
    perspective of the data invisible to the naked
    eye.

20
Future Work
  • Project is now growing in different directions
  • Progressive enhancement with new functions and
    algorithms
  • Turning into a desktop application

21
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com