Design of an Electronic Sanskrit Reader - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Design of an Electronic Sanskrit Reader

Description:

1996 - available on Internet. 1998 - 10000 entries - invariants design ... The structure of the dictionary makes separate as much as possible 3 layers : ... – PowerPoint PPT presentation

Number of Views:589
Avg rating:3.0/5.0
Slides: 22
Provided by: isabelle
Category:

less

Transcript and Presenter's Notes

Title: Design of an Electronic Sanskrit Reader


1
Design of an Electronic Sanskrit Reader
  • SALA XXI,
  • Konstanz, October 2001
  • Gérard Huet
  • INRIA

2
History
  • 1994 - personal lexicon in TeX
  • 1996 - available on Internet
  • 1998 - 10000 entries - invariants design
  • 1999 - reverse engineering
  • 2000 - Hypertext version on the Web, sandhi
    processor, grammatical engine
  • 2001 - Segmenter, tagger

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
Processing chains
7
Close-up view of functionalities
8
Tools used
  • Printable document
  • Knuth TeX, Metafont, LaTeX2e
  • Velthuis devnag font ligature comp
  • Adobe Postscript, Pdf, Acrobat
  • Hypertext document
  • W3C HTTP, HTML, CSS
  • Unicode UTF-8
  • Chris Fynn Indic Times Font
  • Processing Search
  • INRIA Objective Caml

9
Each entry is a (typed) tree
N.B. Syntax is really morphology, usage is part
of speech roles plus meanings
10
Grammatical information
  • type gender Mas Neu Fem Any
  • type number Singular Dual Plural
  • type case Nom Acc Ins Dat Abl Gen
    Loc Voc

11
The verb system
  • type voice Active Reflexive
  • and mode Indicative Imperative Causative
    Intensive Desiderative
  • and tense Present of mode Perfect
    Imperfect Aorist Future
  • and nominal Pp Ppr of voice Ppft Ger
    Infi Peri
  • and verbal Conjug of (tense voice)
  • Passive Absolutive
    Conditional Precative
  • Optative of voice
  • Nominal of nominal
  • Derived of (verbal
    verbal)

12
Governance templates(Grammatical valence)
  • \wordga.n ... \semimputer qqc. ltacc.gt à qqn.
    ltloc.gt
  • \cachandayati \semgratifier qqn. ltacc.gt de
    lti.gt
  • \wordniyuj ... \semconfier qqc. ltacc.gt à qqn.
    ltloc.gt
  • \rootkrii ... \semacheter (qqc. ltacc.gt à qqn.
    ltg. abl.gt)
  • Other specific notations for synonyms, antonyms,
    cross-references.

13
Key points
  • Each entry is a structured piece of data on which
    one may compute
  • Consistency and completeness checks
  • every reference is well defined once, there is no
    dangling reference
  • etymological origins, when known, are
    systematically listed
  • lexicographic ordering at every level is
    mechanically enforced
  • Specialised views are easily extracted
  • Search engines are easily programmable
  • Maintenance and transfer to new technologies is
    ensured
  • Independence from input format, diacritics
    conventions, etc.
  • The technology is scalable to much bigger corpus

14
Generic reuse of the technology
  • The structure of the dictionary makes separate as
    much as possible 3 layers
  • sanskrit
  • french
  • generic dictionary structure
  • Thus the french meanings, at the leaves, could be
    replaced by e.g. english definitions or glosses.

15
Morphological analysis, sandhi
  • Sanskrit is pronounced as written
  • and thus is written as pronounced
  • Phonetic alliteration is rendered by morphology
    junction (sandhi)
  • The sentence is formed of words joined by
    external sandhi
  • Compound words are also formed by external sandhi
  • Whereas flexion, prefixing and suffixing use
    internal sandhi
  • External sandhi is local, internal sandhi is less
  • Sandhi analysis is non-deterministic and
    sometimes involves sem

16
Grammatical engine
  • In sanskrit, declension is determined by stem and
    gender
  • Sanskrit is very regular, since the classical
    language was frozen by Pânini (4th century BC)
    who invented context-free notation
  • But it spans about 35 centuries, and thus there
    are many exceptions
  • Substantive (adjectives, pronouns, numerals)
    declension may be arranged in 84 tables of 24
    endings (3 numbers 8 cases)
  • Then internal sandhi is applied to a stem and an
    ending
  • Two applications
  • online declension of words given with gender
    (cgi-bin)
  • offline computation of flexed forms (2000 pages
    of double-column fineprint)

17
Interactions lexicon-grammar
  • The index engine, when given a string which is
    not a stem defined in one of the entries of the
    lexicon, attempts to find it within the flexed
    forms persistent database, and if found there
    will propose the corresponding lexicon entry or
    entries
  • From within the lexicon, the grammatical engine
    may be called online as a cgi which lists the
    declensions of a given stem. It is directly
    accessible from the gender declarations, because
    of an important scoping invariant
  • every substantive stem is within the scope of
    one or more genders
  • every gender declaration is within the scope of a
    unique substantive stem

18
Inverting external sandhi
  • External sandhi rules are of a finite-state
    nature
  • The flexed forms lexicon index may be seen as the
    graph of a deterministic finite automaton
    recognizing its words
  • This tree may be uniformly decorated by relevant
    sandhi rules seen as non-deterministic choice
    points
  • This structure may be evaluated as a finite-state
    transducer graph segmenting an input text as
    words joined by sandhi

19
Examples of segmentation
  • Chunk o.mnama.h"sivaaya
  • may be segmented as
  • om with sandhi mn -gt .mn
  • namas with sandhi s"s -gt .h"s
  • "sivaaya with no sandhi
  • Chunk kusuma.mgopiibhya.hk.r.s.nodadati
  • may be segmented as
  • kusumam with sandhi mg -gt .mg
  • gopiibhyas with sandhi sk -gt .hk
  • k.r.s.nas with sandhi asd -gt od
  • dadati with no sandhi

20
From segments to tagged lemmas
  • Chunk kusuma.mgopiibhya.hk.r.s.nodadati
  • may be lemmatized with tags as
  • kusumam lt acc. sg. n. of kusuma
  • nom. sg. n. of kusuma
  • voc. sg. n. of kusuma
    gt with sandhi mg -gt .mg
  • gopiibhyas lt abl. pl. f. of gopa
  • dat. pl. f. of
    gopa gt with sandhi sk -gt .hk
  • k.r.s.nas lt nom. sg. m. of k.r.s.na gt with
    sandhi asd -gt od
  • dadati ltgt with no sandhi

21
Future work
  • Verb conjugation tables preparation - full flexed
    forms database
  • Fixing sandhi analysis for bahuvrihi compounds
  • Choice of taggings from concord and valency
    constraints
  • Semantic guidance from ontology classification
  • and we shall then able to semi-automatically
    index corpuses towards
  • computer-aided concordance of corpus
  • computer-aided preparation of critical editions
  • statistical analysis of corpus (co-occurrence,
    style, etc)
  • computer-aided accretion of lexicon
  • fully indexed citations
  • extraction of corpus-specific lexicons
  • diachnony control of lexical information
Write a Comment
User Comments (0)
About PowerShow.com