Requirements, Tools, and Architectures for Annotated Corpora - PowerPoint PPT Presentation

About This Presentation
Title:

Requirements, Tools, and Architectures for Annotated Corpora

Description:

EAGLES/ISLE Workshop. LREC 2000 Athens, Greece. Requirements, Tools, and Architectures ... XCES (EAGLES) American National Corpus ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: drewfn
Category:

less

Transcript and Presenter's Notes

Title: Requirements, Tools, and Architectures for Annotated Corpora


1
Requirements, Tools, and Architectures for
Annotated Corpora
Data Architectures and Software Support for Large
Corpora Towards an American National Corpus
  • Nancy Ide Vassar College
  • Chris Brew Ohio State University

2
Resources are expensive!
  • funders expect to amortize cost of resource
    creation over several projects
  • researchers don't want to reinvent the wheel
  • want to be able to accommodate uses for corpora
    and tools that may not yet be envisaged

3
  • cross-disciplinary acceptance no longer an option
  • we need
  • reusability to avoid unnecessary labor and cost
  • flexibility and extensibility to accommodate
    different applications, different modes and
    media, different approaches, and potential future
    uses

4
Areas for consideration
  • Annotation formats
  • format of annotations themselves
  • Encoding formats
  • markup scheme used to identify and delineate
    elements in the data
  • Data architecture
  • organization of data in terms of document
    structure, linkage
  • Tools architecture
  • framework for tool interoperability
  • Tool support components
  • facilities to enable tools to work efficiently

5
Annotation Formats
  • need not be identical to achieve commonality
  • must work toward specifications that enable
    mapping among annotations of the same type
  • EAGLES/ISLE guidelines
  • layered model
  • universally agreed-upon and applicable
    specifications at the bottom
  • modules for specific languages, applications,
    and/or theoretical approaches at higher levels.

6
Encoding Formats
  • standardized formats required for
  • data interchange
  • enabling easy human-readable display and access
  • may or may not serve as direct input to tools
  • but must be capable of capturing all information
    that is input and output of tools

7
XML
  • international standard, web compatible
  • used in several corpus-handling applications
  • LT XML (Edinburgh)
  • ATLAS (NIST)
  • XCES (EAGLES)
  • American National Corpus
  • provides good tools for linkage, search and
    extraction, validation and error reduction

8
Data Architectures
  • must support
  • full range of annotation types
  • alternative annotations and versions
  • different languages
  • different media and modalities (e.g., text,
    speech signal, audio, video, image)
  • potentially complex linkage among documents,
    parts of documents, and different modalities

9
"Stand-off" Data Architecture
  • annotations maintained in separate documents that
    point back to the original
  • yields a hyper-document composed of the
    original text and all annotations
  • increasingly accepted as the appropriate
    architecture for language resources
  • MULTEXT, LT NSL and LT XML, ATLAS, CES and XCES,
    ANC

10
Advantages
  • avoids unwieldy documents
  • allows for versioning, alternative annotations
  • XML mechanisms support complex inter-document
    linkage, linking various media
  • XSLT enables selecting, transforming, adding to
    multiple documents to create new document

11
Data Models
  • XML support for easy transduction of tags makes
    common tag set less an issue
  • But...must have a common underlying data model
  • formalized description of data objects
  • composition, attributes, class membership,
    applicable procedures, etc
  • relations among these, independent of
    instantiation in any particular form

12
The data model...
  • must be able to capture structure and relations
    in diverse types of data and annotations
  • impacts the design of annotation schema, encoding
    formats, data and tool architectures
  • is the most important current need for
    corpus-based work

13
Existing models
  • TIPSTER
  • object-oriented
  • designed for use in IE
  • ATLAS
  • annotation graph formalism
  • designed for use in speech

Design strongly influenced by background
assumptions that may not scale up
14
Abstraction
  • an annotation is a one- or two-way link between
  • an annotation object, and
  • a point or span (or a list/set of points or
    spans) within a base data set
  • Links may or may not have a semantics
  • Points and spans may be objects, or sets/lists of
    objects

15
Observations
  • assumes fundamental linearity of objects in the
    base
  • time line (speech)
  • sequence of characters, words, sentences, etc.
  • pixel data
  • etc.
  • the granularity of the data representation and
    encoding is critical

Targets may be individual objects or sets or
lists of objects, so information with more than
one dimension is accommodated
16
Implications
  • annotation scheme must be mappable to the
    structures defined for annotation objects
  • encoding scheme must be able to capture the
    object structure and relations expressed in the
    model (e.g., class membership and inheritance)
  • requires sophisticated means to specify linkage
  • consider logistics of identifying spans by
    enclosing them in start and end tags (enabling
    hierarchical grouping of objects in the data),
    vs. explicit addressing of start and end points

17
Implications...
  • must be possible to represent objects and
    relations in some form that is both usable by a
    variety of tools and prevents information loss
  • ideally, in a variety of formats suitable to
    different tools and applications

18
Recommendation
  • Form a group to study this, consisting of
    representatives for
  • different areas of LE (text, speech, etc.)
  • different languages, geographical location
  • different media
  • different user needs
  • Information Retrieval and Computer Science

19
Tools and Tool Architectures
  • must support multi-lingual, multi-modal data
  • must be flexible
  • adaptable to different annotation schemes,
    different applications
  • must be extensible
  • must be reusable

20
Existing systems
  • MULTEXT (1994)
  • developed fundamental data and tool architecture
    for corpora used in subsequent systems
  • tool modularity, pipeline tool architecture
  • API interface
  • SGML encoding standard for linguistic annotation
    (CES)
  • concept of "stand-off" annotation

21
  • LT XML (1999), U of Edinburgh
  • grew out of MULTEXT
  • views XML files as either
  • flat stream of markup and text
  • tree-structured XML
  • powerful query language

22
  • GATE (Sheffield)
  • implements TIPSTER data and tool architecture
  • object model for data and annotation
  • modular tool design, very extensible
  • ATLAS (2000) (NIST)
  • still in development
  • layered data and tool architecture similar to
    previous systems
  • annotation graph formalism instantiated in XML

23
Agreement on tools/systems
  • tool architecture
  • "plug-and-play"
  • modular
  • layered design
  • physical storage representation
  • intermediate data representation (model)
  • API to enable application development
  • query capability
  • stand-off data architecture

24
Details to work out
  • data model
  • level to extend notion of modularity
  • gross function, or minimal function?
  • best means to accommodate different languages,
    modalities
  • engine-based approach, language- or
    medium-specific knowledge as data?

25
Tool Support Components
  • resources are large
  • compression and indexing required for a usable
    system
  • compression is easy
  • excellent compression techniques for XML data
  • indexing is trickier
  • good techniques for full-text search exist
  • but...may not scale up to more complex data

26
Non-traditional data
  • Documents with diagrams, engineering drawings.
  • Illustrated books, with body text and
    illustration intermingled or overlaid
  • Manuscripts in which the physical details of the
    calligraphy and media matter
  • Interlinked texts, including output of machine
    translation systems, speech transcription
    efforts, lexicographic endeavors
  • Databases of phonetic phenomena
  • Personal and public information spaces hard disk
    folder structures, mailing list archives,
    personal email archives, voice mailboxes, etc.
  • Dialogue
  • etc.

27
Recommendations
  • develop architectures that abandon the notion of
    a single distinguished time line
  • adopt ideas from the database community
  • work on semi-structured data
  • work that views XML documents as a collection of
    documents with additional tags and relations
    between tags

28
Conclusion
  • design tools and resources not based on needs of
    a particular research community
  • open architecture approach
  • build on existing standards, emerging consensus
  • (widely) distributed development
  • involve other relevant communities (IR, CS)
Write a Comment
User Comments (0)
About PowerShow.com