Requirements, Tools, and Architectures for Annotated Corpora

About This Presentation

Title:

Requirements, Tools, and Architectures for Annotated Corpora

Description:

EAGLES/ISLE Workshop. LREC 2000 Athens, Greece. Requirements, Tools, and Architectures ... XCES (EAGLES) American National Corpus ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: drewfn

Learn more at: https://www.cs.vassar.edu

Category:

more less

Transcript and Presenter's Notes

Title: Requirements, Tools, and Architectures for Annotated Corpora

1
Requirements, Tools, and Architectures for
Annotated Corpora
Data Architectures and Software Support for Large
Corpora Towards an American National Corpus

Nancy Ide Vassar College
Chris Brew Ohio State University

2
Resources are expensive!

funders expect to amortize cost of resource
creation over several projects
researchers don't want to reinvent the wheel
want to be able to accommodate uses for corpora
and tools that may not yet be envisaged

cross-disciplinary acceptance no longer an option
we need
reusability to avoid unnecessary labor and cost
flexibility and extensibility to accommodate
different applications, different modes and
media, different approaches, and potential future
uses

4
Areas for consideration

Annotation formats
format of annotations themselves
Encoding formats
markup scheme used to identify and delineate
elements in the data
Data architecture
organization of data in terms of document
structure, linkage
Tools architecture
framework for tool interoperability
Tool support components
facilities to enable tools to work efficiently

5
Annotation Formats

need not be identical to achieve commonality
must work toward specifications that enable
mapping among annotations of the same type
EAGLES/ISLE guidelines
layered model
universally agreed-upon and applicable
specifications at the bottom
modules for specific languages, applications,
and/or theoretical approaches at higher levels.

6
Encoding Formats

standardized formats required for
data interchange
enabling easy human-readable display and access
may or may not serve as direct input to tools
but must be capable of capturing all information
that is input and output of tools

7
XML

international standard, web compatible
used in several corpus-handling applications
LT XML (Edinburgh)
ATLAS (NIST)
XCES (EAGLES)
American National Corpus
provides good tools for linkage, search and
extraction, validation and error reduction

8
Data Architectures

must support
full range of annotation types
alternative annotations and versions
different languages
different media and modalities (e.g., text,
speech signal, audio, video, image)
potentially complex linkage among documents,
parts of documents, and different modalities

9
"Stand-off" Data Architecture

annotations maintained in separate documents that
point back to the original
yields a hyper-document composed of the
original text and all annotations
increasingly accepted as the appropriate
architecture for language resources
MULTEXT, LT NSL and LT XML, ATLAS, CES and XCES,
ANC

10
Advantages

avoids unwieldy documents
allows for versioning, alternative annotations
XML mechanisms support complex inter-document
linkage, linking various media
XSLT enables selecting, transforming, adding to
multiple documents to create new document

11
Data Models

XML support for easy transduction of tags makes
common tag set less an issue
But...must have a common underlying data model
formalized description of data objects
composition, attributes, class membership,
applicable procedures, etc
relations among these, independent of
instantiation in any particular form

12
The data model...

must be able to capture structure and relations
in diverse types of data and annotations
impacts the design of annotation schema, encoding
formats, data and tool architectures
is the most important current need for
corpus-based work

13
Existing models

TIPSTER
object-oriented
designed for use in IE
ATLAS
annotation graph formalism
designed for use in speech

Design strongly influenced by background
assumptions that may not scale up
14
Abstraction

an annotation is a one- or two-way link between
an annotation object, and
a point or span (or a list/set of points or
spans) within a base data set
Links may or may not have a semantics
Points and spans may be objects, or sets/lists of
objects

15
Observations

assumes fundamental linearity of objects in the
base
time line (speech)
sequence of characters, words, sentences, etc.
pixel data
etc.
the granularity of the data representation and
encoding is critical

Targets may be individual objects or sets or
lists of objects, so information with more than
one dimension is accommodated
16
Implications

annotation scheme must be mappable to the
structures defined for annotation objects
encoding scheme must be able to capture the
object structure and relations expressed in the
model (e.g., class membership and inheritance)
requires sophisticated means to specify linkage
consider logistics of identifying spans by
enclosing them in start and end tags (enabling
hierarchical grouping of objects in the data),
vs. explicit addressing of start and end points

17
Implications...

must be possible to represent objects and
relations in some form that is both usable by a
variety of tools and prevents information loss
ideally, in a variety of formats suitable to
different tools and applications

18
Recommendation

Form a group to study this, consisting of
representatives for
different areas of LE (text, speech, etc.)
different languages, geographical location
different media
different user needs
Information Retrieval and Computer Science

19
Tools and Tool Architectures

must support multi-lingual, multi-modal data
must be flexible
adaptable to different annotation schemes,
different applications
must be extensible
must be reusable

20
Existing systems

MULTEXT (1994)
developed fundamental data and tool architecture
for corpora used in subsequent systems
tool modularity, pipeline tool architecture
API interface
SGML encoding standard for linguistic annotation
(CES)
concept of "stand-off" annotation

LT XML (1999), U of Edinburgh
grew out of MULTEXT
views XML files as either
flat stream of markup and text
tree-structured XML
powerful query language

GATE (Sheffield)
implements TIPSTER data and tool architecture
object model for data and annotation
modular tool design, very extensible
ATLAS (2000) (NIST)
still in development
layered data and tool architecture similar to
previous systems
annotation graph formalism instantiated in XML

23
Agreement on tools/systems

tool architecture
"plug-and-play"
modular
layered design
physical storage representation
intermediate data representation (model)
API to enable application development
query capability
stand-off data architecture

24
Details to work out

data model
level to extend notion of modularity
gross function, or minimal function?
best means to accommodate different languages,
modalities
engine-based approach, language- or
medium-specific knowledge as data?

25
Tool Support Components

resources are large
compression and indexing required for a usable
system
compression is easy
excellent compression techniques for XML data
indexing is trickier
good techniques for full-text search exist
but...may not scale up to more complex data

26
Non-traditional data

Documents with diagrams, engineering drawings.
Illustrated books, with body text and
illustration intermingled or overlaid
Manuscripts in which the physical details of the
calligraphy and media matter
Interlinked texts, including output of machine
translation systems, speech transcription
efforts, lexicographic endeavors
Databases of phonetic phenomena
Personal and public information spaces hard disk
folder structures, mailing list archives,
personal email archives, voice mailboxes, etc.
Dialogue
etc.

27
Recommendations

develop architectures that abandon the notion of
a single distinguished time line
adopt ideas from the database community
work on semi-structured data
work that views XML documents as a collection of
documents with additional tags and relations
between tags

28
Conclusion

design tools and resources not based on needs of
a particular research community
open architecture approach
build on existing standards, emerging consensus
(widely) distributed development
involve other relevant communities (IR, CS)

Write a Comment

User Comments (0)

About PowerShow.com

Requirements, Tools, and Architectures for Annotated Corpora - PowerPoint PPT Presentation

Requirements, Tools, and Architectures for Annotated Corpora

EAGLES/ISLE Workshop. LREC 2000 Athens, Greece. Requirements, Tools, and Architectures ... XCES (EAGLES) American National Corpus ... – PowerPoint PPT presentation