Title: BancTrad: a parallel annotated corpora web interface
1BancTrad a parallel annotated corpora web
interface
T. Badia, G. Boleda, C. Colominas, M. Garmendia,
A. Gonzàlez, M. Quixal Faculty of Translation
and Interpreting Universitat Pompeu Fabra II
International Conference on Specialised
Translation March 2002 The authors wish to
thank Stefan Bott for his techical help
2Contents
- Overview
- Extralinguistic mark-up and alignment
- Linguistic analysis
- Search machine architecture
- Demo search possibilities and applications
3Overview
- BancTrad a parallel annotated corpora web
interface - Languages Catalan, Spanish, English, German,
French - Sources work done in translation courses,
publishing houses, Internet
4Mark-up and alignment
Extra-linguistic mark-up Original and
translation references Register (colloquial,
standard, learned) Type of text (normative,
descriptive, literary...) Genre ... Alignment
(DéjaVu, Atril)
5Mark-up and alignment
Extra-linguistic mark-up Original and
translation references Register (colloquial,
standard, learned) Type of text (normative,
descriptive, literary...) Genre ... Alignment
(DéjaVu, Atril)
6MS Word form for the mark-up of extralinguistic
features
7Mark-up and alignment
Extra-linguistic mark-up Original and
translation references Register (colloquial,
standard, learned) Type of text (normative,
descriptive, literary...) Genre ... Alignment
(DéjaVu, Atril)
8Alignment with Déjà Vu Maintenance (Atril)
9Linguistic analysis
- Tagger for Catalan
- CATCG (UPF Badia et al. 2000)
- linguistic
- lemma, POS, syntactic function
- Tagger for French, German and English
- TreeTager (IMS Stuttgart Schmidt 1997)
- stochastic
- lemma, POS
10The search engine (I)
Text to corpus Corpus Workbench (CWB) tools
(IMS Stuttgart Christ 1994 ) Corpus query
tool CQP (Corpus Query Processor), from the CWB
11The search engine (II)
Query routing through the client/server
architecture (query from left to right, results
the other way round)
12Search possibilities
Three levels of expertise a) Basic mode
sequences of specific word forms b) Intermediate
mode sequences of five words with possibly
annotations form, lemma, morphosyntactic tag,
and syntactic function c) Expert mode queries
expressed in the syntax of CQP (using regular
expressions) http//glotis.upf.es
13Applications
- Teaching
- Translating
- Doing research (linguistics, translation theory)
- Creating further language resources
- multilingual dictionaries
- chunkers
- stochastic-based machine translation systems
14References
Badia, T., G. Boleda, M. Quixal, E. Bofias (2001)
A modular architecture for the processing of free
text, in Proceedings of the Workshop on Modular
Programming applied to NLP at EUROLAN 2001,
Iasi Christ, Oliver (1994) "A modular and
flexible architecture for an integrated corpus
query system", COMPLEX'94, Budapest http//www.ims
.uni-stuttgart.de/projekte/CorpusWorkbench Schmid
, Helmut (1997) Probabilistic Part-of-Speech
Tagging Using Decision Trees, in Daniel Jones and
Harold Somers, editors, New Methods in Language
Processing Studies in Computational Linguistics,
UCL Press, London, pp. 154-164 http//www.ims.uni-
stuttgart.de/projekte/corplex/TreeTagger/DecisionT
reeTagger.html