From an informal textual lexicon to a well-structured lexical database: An experiment in data reverse engineering - PowerPoint PPT Presentation

About This Presentation

Title:

From an informal textual lexicon to a well-structured lexical database: An experiment in data reverse engineering

Description:

Grind. Flexed. Entries. CGI. Stuttgart October 2001. Tools used. Printable document ... Such a structure, implicit from the data, would have been hard to design ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 25

Provided by: isabelle

Category:

more less

Transcript and Presenter's Notes

Title: From an informal textual lexicon to a well-structured lexical database: An experiment in data reverse engineering

1
From an informal textual lexiconto a
well-structured lexical databaseAn experiment
in data reverse engineering

WCRE 2001,
Stuttgart, October 2001
Gérard Huet
INRIA

2
Introduction

We report on some experiments in data reverse
engineering applied to computational linguistics
resources.
We started from a sanskrit-to-french dictionary
in TeX input format.

3
History

1994 - non-structured lexicon in TeX
1996 - available on Internet
1998 - 10000 entries - invariants design
1999 - reverse engineering
2000 - Hypertext version on the Web, euphony
processor, grammatical engine

4
(No Transcript)
5
Initial semi-structured format

\wordkumArakum\ara m.
gar\c con, jeune homme
prince page cavalier
myth. np. de Kum\ara Prince'',
\'epith. de Skanda Renou 241
-- n. or pur
-- \femkum\ar\\i\/ adolescente, jeune
fille, vierge.

6
Systematic Structuring with macros

\wordkumArakum\ara
\sm \semgarçon, jeune homme
\or \semprince page cavalier
\or \mythnp. de Kum\ara Prince'',
épith. de Skanda Renou 241
\role \sn \semor pur
\role \femkum\ar\\i\/ \semadolescente,
jeune fille, vierge
\fin

7
Uniform cross-referencing

\wordkumaara
\sm
\semgarçon, jeune homme
\or \semprince page cavalier
\or \mythnp. de \npdKumaara "Prince",
épith. de \npSkanda Renou 241
\role \sn \semor pur
\role \femkumaarii \semadolescente,
jeune fille, vierge
\fin

8
Structured Computations

Grinding the data is parsed, compiled into
typed abstract syntax, and a processor is applied
to each entry in order to realise some uniform
computation. This processor is a parameter to the
grinding functor.
For instance, a printing process may compile into
a backend generator in some rendering format. It
may be itself parametric as a virtual typesetter.
Thus the original TeX format may be restored, but
an HTML processor is easily derived as well.
Other processes may for instance call a
grammatical processor in order to generate a full
lexicon of flexed forms. This lexicon of flexed
forms is itself the basis of further
computational linguistics tools (segmenter,
tagger, syntax analyser, corpus editors, etc).

9
(No Transcript)
10
(No Transcript)
11
Processing chains
12
Close-up view of functionalities
13
Tools used

Printable document
Knuth TeX, Metafont, LaTeX2e
Velthuis devnag font ligature comp
Adobe Postscript, Pdf, Acrobat

Hypertext document
W3C HTTP, HTML, CSS
Unicode UTF-8
Chris Fynn Indic Times Font

Processing Search
INRIA Objective Caml

14
Requirements, comments

The Abstract syntax ought to accommodate the
freedom of style of the concrete syntax input
with the minimum changes needed to avoid
ambiguities
This resulted in 3 mutually recursive parsers
(generic dictionary, sanskrit, french) with resp.
(253, 14, 54) productions. Such a structure,
implicit from the data, would have been hard to
design
The typed nature of the abstract syntax (DTD)
enforced on the other hand a much stricter
discipline than what TeX allowed, leading to much
improvement in catching input mistakes
Total processing of the source document (2Mb of
TEXT) takes only 30 seconds on a current PC

15
Each entry is a (typed) tree
16
Grammatical information

type gender Mas Neu Fem Any
type number Singular Dual Plural
type case Nom Acc Ins Dat Abl Gen
Loc Voc

17
The verb system

type voice Active Reflexive
and mode Indicative Imperative Causative
Intensive Desiderative
and tense Present of mode Perfect
Imperfect Aorist Future
and nominal Pp Ppr of voice Ppft Ger
Infi Peri
and verbal Conjug of (tense voice)
Passive Absolutive
Conditional Precative
Optative of voice
Nominal of nominal
Derived of (verbal
verbal)

18
Key points

Each entry is a structured piece of data on which
one may compute
Consistency and completeness checks
every reference is well defined once, there is no
dangling reference
etymological origins, when known, are
systematically listed
lexicographic ordering at every level is
mechanically enforced
Specialised views are easily extracted
Search engines are easily programmable
Maintenance and transfer to new technologies is
ensured
Independence from input format, diacritics
conventions, etc.
The technology is scalable to much bigger corpus

19
Interactions lexicon-grammar

The index engine, when given a string which is
not a stem defined in one of the entries of the
lexicon, attempts to find it within the flexed
forms persistent database, and if found there
will propose the corresponding lexicon entry or
entries
From within the lexicon, the grammatical engine
may be called online as a cgi which lists the
declensions of a given stem. It is directly
accessible from the gender declarations, because
of an important scoping invariant
every substantive stem is within the scope of
one or more genders
every gender declaration is within the scope of a
unique substantive stem

20
The segmenter/tagger

Word segmentation done by euphony defined as a
reversible rational relation, analysed by
non-deterministic transducer compiled from the
flexed forms trie
Each segment may be decorated by the set of
stems/cases leading to the corresponding flexed
form
This yields an interactive tagger which may be
used for corpus analysis
Towards a full computational linguistics platform

21
Reverse engineering strategies

In the standard methodology, some unstructured
text or low-level document is converted once and
for all into a high-level structured document on
which future processing is applied.
Here we convert a semi-structured text document
into a more structured document, in a closely
resembling surface form which may be parsed into
an abstract structure on which further processing
applies.The surface form is kept with minimal
disturbance, and the refinement process may be
iterated, so that more structure may percolate
with time in a continuous life cycle of the data.
E.g. valency, quotations, sources references.

22
Advantages of the refined scheme

Comments are kept within the data in their
original form, and may progressively participate
to the structuring process
The data acquisition/revision processes are not
disturbed
Data revision/accretion may thus proceed
asynchronously with respect to the processing
agents own life cycle
The data is kept in transparent text form, as
opposed to being buried in data base or other
proprietary formats, with risks of long term
obsolescence or vendor dependency.

23
Computational linguistics as an important
application area for reverse engineering

Computational resources data (lexicons, corpuses,
etc) live over very long periods, and they keep
evolving
Multi-layer paradigms and interconnection of
platforms for different languages make it a
necessity to reprocess this data
Computational linguistics is coming of age for
large scale applications, for instance in
information retrieval and in speech recognition
interfaces

24
Ocaml as a reverse engineering workhorse