Flexible Interfaces in the Application - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Flexible Interfaces in the Application

Description:

SciXML: XML markup for the logical structure of research papers ... xsl:template Making the script virtually a stylesheet. Schema Development ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 32
Provided by: nes5
Category:

less

Transcript and Presenter's Notes

Title: Flexible Interfaces in the Application


1
Flexible Interfaces in the Application of
Language Technology to an eScience Corpus
C.J. Rupp, Ann Copestake, Simone Teufel
Benjamin Waldron Computer Laboratory,
University of Cambridge
2
Outline
  • Two key interfaces
  • SciXML XML markup for the logical structure of
    research papers
  • SAF Standoff Annotation Formalism for diverse
    linguistic information
  • Both
  • coded in XML and designed for flexibility,
  • But
  • what that means is distinct in the two cases.

3
SciBorg Architecture
OSCAR
RASP
WSD
RMRS merge
POS tagging
anaphora
tasks
rhetorical analysis
ERG/PET
standoff annotation
4
Sciborg Corpus
  • A corpus of Chemistry research papers from 3
    publishers
  • The Royal Society of Chemistry (RSC),
  • The Nature Publishing Group (NPG), and
  • The International Union of Crystallography.
  • Provided in Publishers XML markup, but with
    distinct markup schemes.

5
Conversion to SciXML
RSC papers
PLOS Biology papers
Nature papers
SciXML
IUCr papers
Biology and CL (pdf)
6
SciXML Interface Requirements
  • Extensible
  • So we can add additional publications
  • Neutral
  • So as not to compromise any IP issues
  • Compatible with existing software
  • Expressive enough
  • For adequate rendering in applications

7
Rendering Issues
  • We assume application will display the paper
  • Probably in Hypertext
  • We must retain enough information to do this
    effectively
  • Previous versions of SciXML have focused on the
    logical structure of scientific papers.

8
The Development of SciXML
  • Developed for a medical corpus (2000)
  • Extracted from HTML web pages
  • Extended for a Computational Linguistics corpus
  • First from LaTeX
  • Then from PDF via OCR
  • Now defined as Relax NG Schema

9
Legacy Issues
  • The original SciXML schema had to interpret
    formatting.
  • Lacking any organisation by function
  • Dictating a flat paragraph structure
  • Collecting all floats and notes in end lists
  • But excluding text formatting

10
Adapted from Publishers Markup
  • List and Table formats
  • Inline text formatting
  • Functional paragraph types (e.g. Theorem)
  • Position markers for floats

11
Conversion by XSLT
  • Most constructs can be handled quite simply
  • ltxsltemplate match"sec"gt
  • ltDIV DEPTH"_at_level"gt
  • ltxslapply-templates/gt
  • lt/DIVgt
  • lt/xsltemplategt
  • Making the script virtually a stylesheet

12
Schema Development
  • Both the XSLT stylesheet and RNG Schema have been
    developed on a naïve basis.
  • Coding conversion for constructs that occur in
    the corpus
  • Eventually we have a big enough bag of tricks to
    make extension quite painless.

13
SciXML Constructs
  • Paper Identifiers
  • Unique identifiers, titles and authors
  • Sections
  • Divisions embed recursively with headers
  • Inline text markup
  • Font settings and LaTeX inclusion
  • Paragraph structure
  • Paragraph elements and sub paragraph boundaries
    in lists, abstracts, captions, etc.

14
SciXML Constructs
  • Citations and Cross References
  • Citations are significant, but we also need
    textual cross references, compound references,
    footnote markers, float markers.
  • Equations and examples
  • (Linguistic) examples and equation environments
  • Lists, tables and figures
  • Lists, including definitions lists, tables,
    figures, and various other sections for
    (external) data.
  • Bibliography
  • The bibliography section is important for
    citation tracking

15
RNG Schema (Fragment)
  • ltdefine name"PAPER.ELEMENT"gt
  • ltelement name"PAPER"gt
  • ltref name"METADATA.ELEMENT" /gt
  • ltoptionalgtltref name"PAGE.ELEMENT"
    /gtlt/optionalgt
  • ltref name"TITLE.ELEMENT" /gt
  • ltoptionalgt ltref name"AUTHORLIST.ELEMENT" /gt
    lt/optionalgt
  • ltoptionalgt ltref name"ABSTRACT.ELEMENT" /gt
    lt/optionalgt
  • ltelement name"BODY"gt
  • ltzeroOrMoregt
  • ltref name"DIV.ELEMENT" /gt
  • lt/zeroOrMoregt
  • lt/elementgt
  • ltoptionalgt
  • ltelement name"ACKNOWLEDGMENTS"gt
  • ltzeroOrMoregt
  • ltchoicegt
  • ltref name"REF.ELEMENT" /gt
  • ltref name"INLINE.ELEMENT" /gt
  • lt/choicegt
  • ltoptionalgt
  • ltref name"REFERENCELIST.ELEMENT"gt
  • lt/optionalgt
  • ltoptionalgt
  • ltref name"AUTHORNOTELIST.ELEMENT"gt
  • lt/optionalgt
  • ltoptionalgt
  • ltref name"FOOTNOTELIST.ELEMENT"gt
  • lt/optionalgt
  • ltoptionalgt
  • ltref name"FIGURELIST.ELEMENT"gt
  • lt/optionalgt
  • ltoptionalgt
  • ltref name"TABLELIST.ELEMENT"gt
  • lt/optionalgt
  • lt/elementgt
  • lt/definegt
  • ltdefine name"REFERENCELIST.ELEMENT"gt

16
Language Technology in Sciborg
  • The goal is Information Extraction from Chemistry
    research papers.
  • various analysis components interfacing
  • Different levels of analysis
  • Different analysis methods
  • Specialised and General analysers
  • But a common semantic representation RMRS
    (Robust Minimal Recursion Semantics)
  • And a common interface structure SAF

17
Multiple Analysis Components
  • PET/ERG deep analysis using detailed (HPSG)
    grammars and lexicons
  • RASP Robust shallow parsing with a statically
    trained grammar
  • Each strand has a tokeniser, tagger and parser
  • OSCAR-3 analyses Chemistry terms and notation

18
Getting the Text out of SciXML
  • Only some spans of marked up text contain
    linguistic text.
  • Using SciXML we can divide element into
  • Text (ltPgt), Markup (ltITgt), Non-Text elements
    (ltSUPgt).
  • The analysers process, ignore and skip these,
    respectively.
  • We also use OSCAR-3 to detect data sections
    without significant text portions.

19
SciBorg Parsing Architecture
OSCAR
RASP parser
Tokeniser for Rasp
SAF Lattice
SciXML
Sentence splitter
POS tagging
PET parser
Tokeniser for ERG
20
SAF Interface Requirements
  • Support results from different analysis
    components.
  • Allow the combination of complementary results
  • But they will assign conflicting structures
  • Ambiguity is common
  • Analyses will form a graph or lattice (c.f. chart
    parsing and word lattices)

21
Motivating Standoff
  • XML can only combine linguistic and formatting
    markup if they share the same tree structure
  • calculated for C11 H18 O3
  • ltITgtcalculated forlt/ITgt CltSBgt11lt/SBgtHltSBgt18lt/SBgtOlt
    SBgt3lt/SBgt
  • ltvgtcalculatedlt/vgt ltppgtfor ltnegtC11H1803lt/negtlt/ppgt

22
Standoff Annotation
  • A common solution is to separate the flow of text
    from the annotations representing its analysis
  • The connection is formed by indexing at some
    consistent common level
  • SAF supports character offset indexing and XPoint
    indexing

23
Character Offset Indexing
  • Formatted text Come here!
  • raw text "ltpgtCome ltigtherelt/igt!lt/pgt"
  • Unicode character points
  • .lt.p.gt.C.o.m.e. .lt.i.gt.h.e.r.e .lt ./ .i .gt .! .lt
    ./ .p .gt .
  • 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
    20 21 22 23
  • Tokens
  • lttoken from'3' to'7' value'Come'/gt
  • lttoken from'11' to'14' value'here'/gt
  • lttoken from'18' to'19' value'!'/gt

24
XPoint Indexing
Root (/)
. P(/1).
. I(/1/2).
. text(/1/3).
. text(/1/1).
. text(/1/2/1).
. C.o.m.e.
. !.
. h.e.r.e.
25
Index Conversion
  • We currently use both character offset and XPoint
    indexing.
  • The choice is influenced by the XML parser.
  • This implies maintaining a conversion table for a
    (SciXML) file.
  • /1/3/0 lt-gt 18

26
Standards for Standoff Annotation
  • MAF ISO standard for morphological annotation
  • SMAF an emergent standard extending this to
    sentence, e.g. for parser input
  • SAF includes all annotations for a paper in one
    file

27
Types of SAF Annotation
  • Sentence segments
  • ltannot type'sentence' id's133' from'42065'
    source'v4987' target'v5154' to'43039'
    value'calculated for C11H18O3.'/gt
  • Tokens
  • ltannot type'token' id't5151' from'42988'
    to'43030' deps's133' source'v5150'
    target'v5151' value'calculated'/gt
  • ltannot type'token' id't5152' from'43031'
    to'43034' deps's133' source'v5151'
    target'v5152' value'for'/gt
  • ltannot type'token' id't5153' from'43035'
    to'43043' deps's133' source'v5152'
    target'v5153' value'C11H18O3'/gt

28
Types of SAF Annotation
  • Part of Speech (POS) Tags
  • ltannot type'pos' id'p5151' deps't5151'
    source'v5150' target'v5151' value'VVN'/gt
  • ltannot type'pos' id'p5152' deps't5152'
    source'v5151' target'v5152' value'IF'/gt
  • ltannot type'pos' id'p5153' deps't5153'
    source'v5152' target'v5153' value'NP1'/gt
  • OSCAR (NER) mark up
  • ltannot from"/1/5/6/27/51/2/83.1"
    to"/1/5/6/27/51/2/88/1.1" type"oscar"
    id"o554"gtltslot name"type"gtcompoundlt/slotgtltslot
    name"surface"gtC11H18O3lt/slotgtltslot
    name"provenance"gtformulaRegexlt/slotgtlt/annotgt

29
Types of SAF Annotation
  • RMRS analyses
  • ltrmrs cfrom'42329' cto'43303'gt
  • ltlabel vid'420'/gt
  • ltep cfrom'43258 cto'43288'gtltgpredgtproper_q_rellt
    /gpredgtltlabel vid'409'/gtltvar sort'x'
    vid'410'/gtlt/epgt
  • ltep cfrom'43258' cto'43288'gtltgpredgtnamed_rellt/gp
    redgtltlabel vid'411'/gtltvar sort'x'
    vid'410'/gtlt/epgt
  • ltrarggtltrargnamegtRSTRlt/rargnamegtltlabel
    vid'409'/gtltvar sort'h' vid'412'/gtlt/rarggt
  • ltrarggtltrargnamegtBODYlt/rargnamegtltlabel
    vid'409'/gtltvar sort'h' vid'413'/gtlt/rarggt
  • ltrarggtltrargnamegtCARGlt/rargnamegtltlabel
    vid'411'/gtltconstantgtc11h18o3lt/constantgtlt/rarggt
  • lthcons hreln'qeq'gtlthigtltvar sort'h'
    vid'412'/gtlt/higtltlogtltlabel vid'411'/gtlt/logtlt/hcons
    gt
  • lt/rmrsgt

30
SAF Flexibility
  • The standoff supports a variety of annotation
    types
  • Which communicate between different levels of
    analysis
  • And between different analysis paths
  • Hence it is also the main route for communication
    in the architecture

31
SciXML Flexibility
  • A common representation for the logical structure
    and essential formatting of research papers
  • Conversion from various publishers markup
    schemes
  • And, also, from HTML, LaTeX and PDF
  • Applied to several disciplines
Write a Comment
User Comments (0)
About PowerShow.com