Cultural Heritage Language Technologies - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Cultural Heritage Language Technologies

Description:

Lower the barriers to reading Greek, Latin and Old Norse texts in their original ... Early modern Latin texts from the Stoa Consortium at the University of Kentucky ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 22
Provided by: jeffryd
Category:

less

Transcript and Presenter's Notes

Title: Cultural Heritage Language Technologies


1
Cultural Heritage Language Technologies
  • Jeffrey A. Rydberg-Cox
  • Assistant Professor
  • University of Missouri at Kansas City
  • Department of English
  • Co-Director, Classical Studies Program
  • rydbergcoxj_at_umkc.edu

2
Cultural Heritage Language Technologies
  • A collaborative project to create computational
    tools for the study of Ancient Greek, Early
    Modern Latin, and Old Norse texts in a network of
    affiliated digital libraries.
  • Project funding provided by the National Science
    Foundation and the European Union International
    Digital Library Collaborative Research Program

3
United States Partners
  • Classical Studies Program and Department of
    English, University of Missouri at Kansas City
  • The Perseus Project, Tufts University
  • The Stoa Consortium, University of Kentucky
  • Scandinavian Section, University of California at
    Los Angeles

4
European Partners
  • The Newton Project and the Department of Computer
    Science, Imperial College
  • Faculty of Classics, Cambridge University
  • Istituto di Linguistica Computazionale del CNR,
    Pisa
  • Arnamagnaean Institute, University of Copenhagen

5
Project Goals
  • Adapt techniques from fields of computational
    linguistics, information retrieval and
    visualization, and data mining for students and
    scholars in the humanities.
  • Establish an international framework for the
    long-term preservation of data, the sharing of
    metadata, and interoperability between affiliated
    digital libraries
  • Lower the barriers to reading Greek, Latin and
    Old Norse texts in their original languages.

6
Core Technology
  • Core digital library technology will be provided
    as open-source software to all collaborators by
    Perseus
  • Proven DL system that delivers 8.5 million pages
    a month over the web
  • System already in use by three American
    collaborators
  • Applications will be integrated into the
    production system and made widely available
  • System design allows for modularity so that
    applications can be used in other DL environments
    or on their own

7
Testbeds
  • Greek and Latin texts from Perseus (6 million
    words of Greek, 4 million words of Latin,
    parallel English translations)
  • Works of Issac Newton from the Newton Project at
    Imperial College
  • Early modern Latin texts from the Stoa Consortium
    at the University of Kentucky
  • Old Norse texts from the University of California
    at Los Angeles and the Arnamagnaean Institute
  • Texts from the Archimedes Project (DFG/NSF funded
    DL for the history of mechanics)

8
Language Technologies
  • NLP technology is mature with a focus on
    commercial and national security applications.
  • TREC
  • TIDES
  • CLEF
  • Which of these technologies are language
    dependent and, therefore, need to be optimized
    for cultural heritage languages.
  • Which of these technologies are most useful for
    users in the humanities?

9
Parsers
  • The extension or development of morphological
    analysis facilities for early modern Latin and
    Old Norse is fundamental for these applications.
  • Simple stemming techniques (e.g. Porter's
    algorithm) are not precise enough.
  • In highly inflected languages, lexical
    normalization is required in order to have enough
    data to obtain statistically significant results.
  • Perseus Project will provide a parser for
    Classical Greek and Latin.
  • The Istituto di Linguistica Computazionale del
    CNR, Pisa will develop a system for early modern
    Latin.
  • The University of California at Los Angeles and
    the Arnamagnaean Institute and will create a
    parser for Old Norse

10
Integrated Reading Environment
  • Parser is used to automatically generate
    hypertext leading to word study tool
  • Word study tool shows lexical form, links to
    dictionaries, grammars, frequency, and search
    tools
  • Perseus text display technology initially built
    for Greek, Latin, and English. Will be
    generalized for Old Norse (and other languages)

11
Multi-Lingual Information Retrieval
  • Useful technology for non-specialist scholars and
    students who know a little bit of Greek, Latin,
    or Old Norse, but who are not able to form
    intelligent queries in the original language.
  • The Perseus Project already has parser, lexica
    and parallel corpora to develop these tools for
    Classical Greek and Latin.
  • As Old Norse and Early Modern Latin resources are
    developed, we will scale the tools for these new
    languages.
  • It is essential to develop common indexing
    formats so that the tool will easily scale to
    other languages.

12
Information Visualization
  • Search results are most frequently presented in a
    list that is either unranked or sorted by a
    metrics (i.e. tf x idf) that are opaque to the
    end user.
  • Collaborators at Imperial College have done a
    great deal of work clustering and visualizing
    English search results.
  • We will adapt these clustering algorithms for our
    target languages.

13
Sammon Map
14
Radial Map
15
Interactive Radial
16
Vocabulary Profiles
  • Even extremely simple applications such as word
    lists can be useful for non-specialist users
  • Integration with other measures such as
    frequency, relative frequency, tf x idf scores,
    and measures of lexical richness can make these
    tools even more useful.

17
Syntactic Parsing Tools
  • Resolution of attachment ambiguity
  • Determination of subcategorization frames
  • Generation of parse trees
  • These tasks are more difficult for inflected
    languages because we cannot rely on word order

18
Integration of Expert Knowledge
  • While automatic processes can provide a great
    deal of useful information, scholars will want to
    correct, annotate, and extend automatic results
  • Usual practice is for scholars to print screen
    dumps and hand annotate
  • We need systems to capture this knowledge about
    morphology and syntax and reintegrate it into the
    DL system

19
Digital Library Interoperability
  • We have moved from docu-island CD-ROM
    publications to large bunker digital libraries.
  • Interoperable DLs are the next step (i.e.
    National SMETE Digital Library services projects)

20
Parallel Presentation of Different Versions of
Texts
  • Humanists often work with different versions or
    translations of texts
  • The existing Perseus infrastructure allows users
    to switch easily between different versions
  • OAI metadata will be used to identify other
    digital versions of the document and allow users
    to easily switch to those versions

21
Citation Reversal
  • Perseus system takes citations of works from one
    text and displays them as hyperlinks in the text
    being cited
  • Thus, a reader of the Iliad 1.1 can see that
    Aristotles Rhetoric cites this passage and
    immediately move to that text
  • Citation reversal will be extended as a service
Write a Comment
User Comments (0)
About PowerShow.com