MASC The Manually Annotated Sub-Corpus of American English - PowerPoint PPT Presentation

About This Presentation

Title:

MASC The Manually Annotated Sub-Corpus of American English

Description:

Number of Views:66

Avg rating:3.0/5.0

Slides: 16

Provided by: nanc182

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: MASC The Manually Annotated Sub-Corpus of American English

1
MASCThe Manually Annotated Sub-Corpus of
American English

Nancy Ide, Collin Baker, Christiane Fellbaum,
Charles Fillmore, Rebecca Passonneau

2
MASC

Manually Annotated Sub-Corpus
NSF-funded project to provide a sharable,
reusable annotated resource with rich linguistic
annotations
Vassar, ICSI, Columbia, Princeton
texts from diverse genres
manual annotations or manually-validated
annotations for multiple levels
WordNet senses
FrameNet frames and frame
shallow parses
named entities
Enables linking WordNet senses and FrameNet
frames into more complex semantic structures
Enriches semantic and pragmatic information
detailed inter-annotator agreement measures

3
Contents

Texts drawn from the Open ANC
Several genres
Written (travel guides, blog, fiction, letters,
newspaper, non-fiction, technical, journal,
government documents)
Spoken (face-to-face, academic, telephone)
Free of license restrictions, redistributable
Download from ANC website
All MASC data and annotations will be freely
downloadable

4
Annotation Process

Smaller portions of the sub-corpus manually
annotated for specific phenomena
Maintain representativeness
Include as many annotations of different types as
possible
Apply (semi)-automatic annotation techniques to
determine the reliability of their results
Study inter-annotator agreement on
manually-produced annotations
Determine benchmark of accuracy
Fine-tune annotator guidelines
Consider if accurate annotations for one
phenomenon can improve performance of automatic
annotation systems for another
E.G., Validated WN sense tags and noun chunks may
improve automatic semantic role labeling

5
Process (continued)

6
Composition Relative to Whole OANC
WordNet annotations
7
MASC Core

8
Representation

ISO TC37 SC4 Linguistic Annotation Framework
Graph of feature structures (GrAF)
isomorphic to other feature structure-based
representations (e.g. UIMA CAS)
Each annotation in a separate stand-off document
linked to primary data or other annotations
Merge annotations with ANC API
Output in any of several formats
XML
non-XML for use with systems such as NLTK and
concordancing tools
UIMA CAS
Input to GraphViz

9
WordNet annotation

Updating WSD systems to use WordNet version 3.0
Pedersons SenseRelate
Mihalcea et al.s SenseLearner
Apply to automatically assign WN sense tags to
all content words (nouns, verbs, adjectives, and
adverbs) in the entire OANC
Manually validate a set of words from whole OANC
Manually validate all words in 25K FN-annotated
subset

10
FrameNet Annotation

11
Alignment of Lexical Resources

Concurrent project investigating how and to what
extent WordNet and FrameNet can be aligned
MASC annotations of 25K for FrameNet frames and
frame elements and WordNet senses provide a
ready-made testing ground

12
Interannotator agreement

Use a suite of metrics that measure different
characteristics
Interannotator agreement coefficients such as
Cohens Kappa
Average F-measure to determine proportion of the
annotated data all annotators agree on

13
IAA

Determine impact of these two measures
consider the relation between the agreement
coefficient values / F-measure and potential
users of the planned annotations
Simultaneous investigations of interannotator
agreement and measurable results of using
different annotations of the same data provide a
stronger picture of the integrity of annotated
data (Passonneau et al. 2005 Passonneau et al.
2006 )

14
Overall Goal

Continually augment MASC with contributed
annotations from the research community
Discourse structure, additional entities, events,
opinions, etc.
Distribution of effort and integration of
currently independent resources such as the ANC,
WordNet, and FrameNet will enable progress in
resource development
Less cost
No duplication of effort
Greater degree of accuracy and usability
Harmonization

15
Conclusion

MASC will provide a much-needed resource for
computational linguistics research aimed at the
development of robust language processing systems
MASCs availability should have a major impact on
the speed with which similar resources can be
reliably annotated
MASC will be the largest semantically annotated
corpus of English in existence
WN and FN annotation of the MASC will immediately
create a massive multi-lingual resource network
Both WN and FN linked to corresponding resources
in other languages
No existing resource approaches this scope