ANCOANC, MASC, LAFGrAF, and ULA - PowerPoint PPT Presentation

About This Presentation
Title:

ANCOANC, MASC, LAFGrAF, and ULA

Description:

... across several genres. Written (travel guides, blog, fiction, letters, newspaper, non ... 15 million word subset, free of license restrictions, ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 22
Provided by: nanc168
Category:
Tags: ancoanc | lafgraf | masc | ula

less

Transcript and Presenter's Notes

Title: ANCOANC, MASC, LAFGrAF, and ULA


1
ANC/OANC, MASC, LAF/GrAF, and ULA
  • Nancy Ide
  • Vassar College

2
ANC
ANC
OANC
  • So far 22 million words across several genres
  • Written (travel guides, blog, fiction, letters,
    newspaper, non-fiction, technical, journal,
    government documents)
  • Spoken (face-to-face, academic, telephone)
  • Available through LDC (75)
  • OANC
  • 15 million word subset, free of license
    restrictions, redistributable
  • Download from ANC website
  • Will add more data (20m words) this summer
  • All part of OANC
  • WSJ that is open, annotated by TimeBank,
    PropBank, NomBank, PTB, PDTB (34 documents)

3
ANC Annotations
  • ANC is distributed with stand-off annotations for
    POS and lemma (2 different tagsets), noun and
    verb chunks
  • Automatically produced, no validation
  • Recently added
  • Co-reference for portion of Slate Magazine
  • CLAWS C5 and C7
  • Download from ANC site (http//AmericanNationalCor
    pus.org)

4
ANC Annotations
  • Want to add annotations for a wide variety of
    linguistic phenomena
  • Use any software we can get our hands on
  • Freely available
  • Contributed or allowed use
  • E.g. (hopefully) BBN NER software
  • Contributed annotations
  • Multiple annotations of the same type
  • E.g. syntactic annotation Charniak, Collins,
    Minipar, CMU Link parser
  • Adding annotations as funding allows

5
ANC Process
Merge some or all annotations
Automatically annotate
ANC processing
ANC Tool
6
Merging
  • ANC does not merge on linguistic grounds (i.e.,
    no unification, a la GLARF)
  • Merging involves simply combining information
    referring to common elements (spans, nodes, etc.)

7
MASC
  • Manually Annotated Sub-Corpus
  • NSF-funded project to
  • Validate token and sentence boundaries, noun
    chunks, verb chunks, POS in a 5 million? word
    sub-corpus of the ANC
  • Budget cut by over 50 so we are not sure of
    final size
  • Manually or semi-automatically annotate for
    WordNet senses and FrameNet frames
  • ICSI, Princeton
  • IA Agreement studies
  • Columbia

8
MASC
9
LAF/GrAF
  • Linguistic Annotation Framework (LAF) developed
    in ISO TC37 SC4
  • LAF defines a framework involving an abstract
    model for annotations that serves as a pivot
    into and out of which other formats are mapped
  • Abstract model a graph of typed feature
    structures representing stand-off annotations
  • GrAF
  • The XML instantiation of the LAF abstract model
  • ANC is represented using GrAF
  • All annotations are stand-off
  • Multiple annotations of the same type (e.g. POS)
  • Annotations of multiple types
  • POS, noun and verb chunks, adding WordNet,
    FrameNet, dependency parse
  • We have an IBM Innovation Award to map GrAF to
    UIMA CAS
  • The graph of feature structures is the model
    underlying design of CAS

10
OANC, MASC, ULA
OANC 15m
MASC Core 50K
ULA 40K
Extra 10K fiction court transcript blog
11
GrAF and ULA
  • As proof of concept, transduced WSJ annotations
    for NomBank, TimeBank, PropBank, PTB, PDTB into
    GrAF
  • Separate stand-off documents
  • Merged annotations
  • Generated GraphViz output

Paper at LAW I, Prague, 2007
12
Merging Annotations
  • Involves simply combining the graphs for each
    annotation
  • Graph algorithms can be applied to collapse
    identically-labeled nodes with edges to common
    subgraphs

13
Transduction
PTB
PropBank
NomBank
PDTB
TimeBank
14
GraphViz Output
15
ULA/MASC Corpus
  • PTB annotations transduced to GrAF
  • Merged PTBs POS labels with Penn tagging
    produced by automatic tagger in GATE
  • Using to validate
  • Will also do for NP, VP

16
FrameNet example
17
Contributions
  • Any annotation of OANC we get (automatic or
    manual), we will transduce to GrAF and make
    available
  • Please contribute annotations
  • Web interface for contributions
  • Web interface create-a-corpus
  • Allows user to choose from among available
    annotations and available OANC texts, merge,
    produce output in format of choice
  • Choose
  • how to combine (e.g. if two tokens are one in
    other scheme)
  • strong vs. weak merging (?)
  • Could e.g. generate a PropBank annotation
    referring to spans instead of PTB elements
  • May also allow various annotation formats -gt GrAF
  • We already have an API for transducing all the
    ULA annotation formats plus others (e.g.FrameNet)
    to GrAF

18
Contributions
  • Also contribute texts
  • American English
  • Post-1990
  • Unrestricted for re-distribution
  • Web interface for that too

19
Contributions
  • Software for annotation
  • documentHandler interfaces for additional output
    formats from ANCTool
  • Derived data
  • Frequency lists, bigrams, trigrams (some already
    there)
  • Anything else

20
Desideratum
  • OANC texts, other freely re-distributable texts
    annotated by many projects/researchers for many
    phenomena
  • We are providing the infrastructure, you provide
    the annotations!

21
SIGANN
  • New ACL Special Interest Group for Annotation
  • SIGANN Shared Corpus ULA corpus or MASC 50K
    core
  • Solicit annotations of all types from community
Write a Comment
User Comments (0)
About PowerShow.com