MASC The Manually Annotated Sub-Corpus of American English - PowerPoint PPT Presentation

About This Presentation
Title:

MASC The Manually Annotated Sub-Corpus of American English

Description:

texts from diverse genres ... Several genres ... Smaller portions of the sub-corpus manually annotated for specific phenomena ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 16
Provided by: nanc182
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: MASC The Manually Annotated Sub-Corpus of American English


1
MASCThe Manually Annotated Sub-Corpus of
American English
  • Nancy Ide, Collin Baker, Christiane Fellbaum,
    Charles Fillmore, Rebecca Passonneau

2
MASC
  • Manually Annotated Sub-Corpus
  • NSF-funded project to provide a sharable,
    reusable annotated resource with rich linguistic
    annotations
  • Vassar, ICSI, Columbia, Princeton
  • texts from diverse genres
  • manual annotations or manually-validated
    annotations for multiple levels
  • WordNet senses
  • FrameNet frames and frame
  • shallow parses
  • named entities
  • Enables linking WordNet senses and FrameNet
    frames into more complex semantic structures
  • Enriches semantic and pragmatic information
  • detailed inter-annotator agreement measures

3
Contents
  • Texts drawn from the Open ANC
  • Several genres
  • Written (travel guides, blog, fiction, letters,
    newspaper, non-fiction, technical, journal,
    government documents)
  • Spoken (face-to-face, academic, telephone)
  • Free of license restrictions, redistributable
  • Download from ANC website
  • All MASC data and annotations will be freely
    downloadable

4
Annotation Process
  • Smaller portions of the sub-corpus manually
    annotated for specific phenomena
  • Maintain representativeness
  • Include as many annotations of different types as
    possible
  • Apply (semi)-automatic annotation techniques to
    determine the reliability of their results
  • Study inter-annotator agreement on
    manually-produced annotations
  • Determine benchmark of accuracy
  • Fine-tune annotator guidelines
  • Consider if accurate annotations for one
    phenomenon can improve performance of automatic
    annotation systems for another
  • E.G., Validated WN sense tags and noun chunks may
    improve automatic semantic role labeling

5
Process (continued)
  • Apply iterative process to maximize performance
    of automatic taggers
  • Manual annotation
  • Retrain automatic annotation software
  • Improved annotation software can later be applied
    to the entire ANC
  • Provide more accurate automatically-produced
    annotation of full corpus

6
Composition Relative to Whole OANC
WordNet annotations
7
MASC Core
  • Includes
  • 25K fully annotated (all words) for FrameNet
    frames and WordNet senses
  • 40K corpus annotated by Unified Linguistic
    Annotation project
  • PropBank, NomBank, Penn Treebank, Penn Discourse
    Treebank, TimeBank
  • Small subset of WSJ with many annotation
  • Other annotations rendered into GrAF for
    compatibility

8
Representation
  • ISO TC37 SC4 Linguistic Annotation Framework
  • Graph of feature structures (GrAF)
  • isomorphic to other feature structure-based
    representations (e.g. UIMA CAS)
  • Each annotation in a separate stand-off document
    linked to primary data or other annotations
  • Merge annotations with ANC API
  • Output in any of several formats
  • XML
  • non-XML for use with systems such as NLTK and
    concordancing tools
  • UIMA CAS
  • Input to GraphViz

9
WordNet annotation
  • Updating WSD systems to use WordNet version 3.0
  • Pedersons SenseRelate
  • Mihalcea et al.s SenseLearner
  • Apply to automatically assign WN sense tags to
    all content words (nouns, verbs, adjectives, and
    adverbs) in the entire OANC
  • Manually validate a set of words from whole OANC
  • Manually validate all words in 25K FN-annotated
    subset

10
FrameNet Annotation
  • Full manual annotation of 25K in FrameNet
    full-text manner
  • Application of automatic semantic role labeling
    software over entire MASC
  • Improve automatic semantic role labeling (ASRL)
  • Use active learning
  • ASRL system results evaluated to determine where
    the most errors occur
  • Extra manual annotation done to improve
    performance
  • Draw from entire OANC, possibly even other
    sources for examples

11
Alignment of Lexical Resources
  • Concurrent project investigating how and to what
    extent WordNet and FrameNet can be aligned
  • MASC annotations of 25K for FrameNet frames and
    frame elements and WordNet senses provide a
    ready-made testing ground

12
Interannotator agreement
  • Use a suite of metrics that measure different
    characteristics
  • Interannotator agreement coefficients such as
    Cohens Kappa
  • Average F-measure to determine proportion of the
    annotated data all annotators agree on

13
IAA
  • Determine impact of these two measures
  • consider the relation between the agreement
    coefficient values / F-measure and potential
    users of the planned annotations
  • Simultaneous investigations of interannotator
    agreement and measurable results of using
    different annotations of the same data provide a
    stronger picture of the integrity of annotated
    data (Passonneau et al. 2005 Passonneau et al.
    2006 )

14
Overall Goal
  • Continually augment MASC with contributed
    annotations from the research community
  • Discourse structure, additional entities, events,
    opinions, etc.
  • Distribution of effort and integration of
    currently independent resources such as the ANC,
    WordNet, and FrameNet will enable progress in
    resource development
  • Less cost
  • No duplication of effort
  • Greater degree of accuracy and usability
  • Harmonization

15
Conclusion
  • MASC will provide a much-needed resource for
    computational linguistics research aimed at the
    development of robust language processing systems
  • MASCs availability should have a major impact on
    the speed with which similar resources can be
    reliably annotated
  • MASC will be the largest semantically annotated
    corpus of English in existence
  • WN and FN annotation of the MASC will immediately
    create a massive multi-lingual resource network
  • Both WN and FN linked to corresponding resources
    in other languages
  • No existing resource approaches this scope
Write a Comment
User Comments (0)
About PowerShow.com