Title: Requirements, Tools, and Architectures for Annotated Corpora
1Requirements, Tools, and Architectures for
Annotated Corpora
Data Architectures and Software Support for Large
Corpora Towards an American National Corpus
- Nancy Ide Vassar College
- Chris Brew Ohio State University
2Resources are expensive!
- funders expect to amortize cost of resource
creation over several projects - researchers don't want to reinvent the wheel
- want to be able to accommodate uses for corpora
and tools that may not yet be envisaged
3- cross-disciplinary acceptance no longer an option
- we need
- reusability to avoid unnecessary labor and cost
- flexibility and extensibility to accommodate
different applications, different modes and
media, different approaches, and potential future
uses
4Areas for consideration
- Annotation formats
- format of annotations themselves
- Encoding formats
- markup scheme used to identify and delineate
elements in the data - Data architecture
- organization of data in terms of document
structure, linkage - Tools architecture
- framework for tool interoperability
- Tool support components
- facilities to enable tools to work efficiently
5Annotation Formats
- need not be identical to achieve commonality
- must work toward specifications that enable
mapping among annotations of the same type - EAGLES/ISLE guidelines
- layered model
- universally agreed-upon and applicable
specifications at the bottom - modules for specific languages, applications,
and/or theoretical approaches at higher levels.
6Encoding Formats
- standardized formats required for
- data interchange
- enabling easy human-readable display and access
- may or may not serve as direct input to tools
- but must be capable of capturing all information
that is input and output of tools
7XML
- international standard, web compatible
- used in several corpus-handling applications
- LT XML (Edinburgh)
- ATLAS (NIST)
- XCES (EAGLES)
- American National Corpus
- provides good tools for linkage, search and
extraction, validation and error reduction
8Data Architectures
- must support
- full range of annotation types
- alternative annotations and versions
- different languages
- different media and modalities (e.g., text,
speech signal, audio, video, image) - potentially complex linkage among documents,
parts of documents, and different modalities
9"Stand-off" Data Architecture
- annotations maintained in separate documents that
point back to the original - yields a hyper-document composed of the
original text and all annotations - increasingly accepted as the appropriate
architecture for language resources - MULTEXT, LT NSL and LT XML, ATLAS, CES and XCES,
ANC
10Advantages
- avoids unwieldy documents
- allows for versioning, alternative annotations
- XML mechanisms support complex inter-document
linkage, linking various media - XSLT enables selecting, transforming, adding to
multiple documents to create new document
11Data Models
- XML support for easy transduction of tags makes
common tag set less an issue - But...must have a common underlying data model
- formalized description of data objects
- composition, attributes, class membership,
applicable procedures, etc - relations among these, independent of
instantiation in any particular form
12The data model...
- must be able to capture structure and relations
in diverse types of data and annotations - impacts the design of annotation schema, encoding
formats, data and tool architectures - is the most important current need for
corpus-based work
13Existing models
- TIPSTER
- object-oriented
- designed for use in IE
- ATLAS
- annotation graph formalism
- designed for use in speech
Design strongly influenced by background
assumptions that may not scale up
14Abstraction
- an annotation is a one- or two-way link between
- an annotation object, and
- a point or span (or a list/set of points or
spans) within a base data set - Links may or may not have a semantics
- Points and spans may be objects, or sets/lists of
objects
15Observations
- assumes fundamental linearity of objects in the
base - time line (speech)
- sequence of characters, words, sentences, etc.
- pixel data
- etc.
- the granularity of the data representation and
encoding is critical
Targets may be individual objects or sets or
lists of objects, so information with more than
one dimension is accommodated
16Implications
- annotation scheme must be mappable to the
structures defined for annotation objects - encoding scheme must be able to capture the
object structure and relations expressed in the
model (e.g., class membership and inheritance) - requires sophisticated means to specify linkage
- consider logistics of identifying spans by
enclosing them in start and end tags (enabling
hierarchical grouping of objects in the data),
vs. explicit addressing of start and end points
17Implications...
- must be possible to represent objects and
relations in some form that is both usable by a
variety of tools and prevents information loss - ideally, in a variety of formats suitable to
different tools and applications
18Recommendation
- Form a group to study this, consisting of
representatives for - different areas of LE (text, speech, etc.)
- different languages, geographical location
- different media
- different user needs
- Information Retrieval and Computer Science
19Tools and Tool Architectures
- must support multi-lingual, multi-modal data
- must be flexible
- adaptable to different annotation schemes,
different applications - must be extensible
- must be reusable
20Existing systems
- MULTEXT (1994)
- developed fundamental data and tool architecture
for corpora used in subsequent systems - tool modularity, pipeline tool architecture
- API interface
- SGML encoding standard for linguistic annotation
(CES) - concept of "stand-off" annotation
21- LT XML (1999), U of Edinburgh
- grew out of MULTEXT
- views XML files as either
- flat stream of markup and text
- tree-structured XML
- powerful query language
22- GATE (Sheffield)
- implements TIPSTER data and tool architecture
- object model for data and annotation
- modular tool design, very extensible
- ATLAS (2000) (NIST)
- still in development
- layered data and tool architecture similar to
previous systems - annotation graph formalism instantiated in XML
23Agreement on tools/systems
- tool architecture
- "plug-and-play"
- modular
- layered design
- physical storage representation
- intermediate data representation (model)
- API to enable application development
- query capability
- stand-off data architecture
24Details to work out
- data model
- level to extend notion of modularity
- gross function, or minimal function?
- best means to accommodate different languages,
modalities - engine-based approach, language- or
medium-specific knowledge as data?
25Tool Support Components
- resources are large
- compression and indexing required for a usable
system - compression is easy
- excellent compression techniques for XML data
- indexing is trickier
- good techniques for full-text search exist
- but...may not scale up to more complex data
26Non-traditional data
- Documents with diagrams, engineering drawings.
- Illustrated books, with body text and
illustration intermingled or overlaid - Manuscripts in which the physical details of the
calligraphy and media matter - Interlinked texts, including output of machine
translation systems, speech transcription
efforts, lexicographic endeavors - Databases of phonetic phenomena
- Personal and public information spaces hard disk
folder structures, mailing list archives,
personal email archives, voice mailboxes, etc. - Dialogue
- etc.
27Recommendations
- develop architectures that abandon the notion of
a single distinguished time line - adopt ideas from the database community
- work on semi-structured data
- work that views XML documents as a collection of
documents with additional tags and relations
between tags
28Conclusion
- design tools and resources not based on needs of
a particular research community - open architecture approach
- build on existing standards, emerging consensus
- (widely) distributed development
- involve other relevant communities (IR, CS)