Title: A Suite to Compile and Analyze an LSP Corpus
1A Suite to Compile and Analyze an LSP Corpus
- 6th International Conference
- on Language Resources and Evaluation
- LREC 2008
Rogelio Nazar Jorge Vivaldi M. Teresa Cabré
rogelio.nazar jorge.vivaldi teresa.cabre_at_upf.e
du
2Introduction
- This system (JAGUAR) is a set of tools for
compiling and exploring an LSP corpus from the
web - http//jaguar.iula.upf.edu
- Usage Examples
- Terminology extraction
- Bilingual lexicon extraction
- Neologisms extraction
- Architecture a system divided in two main
modules - Compilation of an LSP corpus from the web
- Analysis of the corpus with statistical
techniques
3Module 1 Compilation of an LSP corpus from the
web
- Document retrieval by querying search engines
- Classification of the collection on the basis of
two axis - Degree of relevance to the topic
- Possibility of corpus tuning with user feedback
- Degree of specialization of the document
- Structure of the document (abstract,
introduction, etc.) - System for bibliographical references, etc.
Final classification is the result of the
combination of these factors.
4Module 1 Compilation of an LSP corpus from the
web
Classification by degree of relevance to the
topic
5Module 1 Compilation of an LSP corpus from the
web
Classification by degree of relevance to the
topiccoocurrence graphs
6Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Cumulative precision in the ranking of documents
with the term spastic diplegia.
7Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Precision and Recall for the experiments.
8Module 1 Compilation of an LSP corpus from the
web
Evaluation of the documents classification
Probability distribution of precision as a random
variable (performance of 10.000 random
classifiers).
9Module 2 Analysis of the corpus with statistical
techniques
- 1. Input from module 1 or from user compiled
corpus - 2. Main functions
- Measures of vocabulary richness
- Analysis of sample representativeness
- Automatic language recognition
- Kwic search
- N-grams extraction and sorting
- Collocations extraction
- Measures of association
- Models of term distribution
- Coefficients for vector comparison
10 11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19- Conclusions
- We have presented the system JAGUAR, set of tools
for compiling and exploring an LSP corpus from
the web - The main characteristics of this suit are the
following - It is able to collect an LSP corpus from the
web, ensuring the thematic adequacy and degree of
specialization to a given domain - It offers tools to statistically explore such
collection in a friendly interface - It has also been conceived as a library
- The original algorithms have been successfully
evaluated - It usage save time and effort in the analysis of
a corpus offering also new insights, a
perspective of the data invisible to the naked
eye.
20Future Work
- Project is now growing in different directions
- Progressive enhancement with new functions and
algorithms - Turning into a desktop application
21Thanks!