Title: Relations between multiple annotations: Representation, Inferences, Context Specification, and Unifi
1Relations between multiple annotations
Representation, Inferences, Context
Specification, and Unification
- Andreas Witt
- Dieter Metzing
- Jens Pönninghaus
- Daniela Goecke
www.text-technology.de
2Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
3Project description
- The Project secondary information structuring and
comparative discourse analysis (Sekimo) is part
of the DFG-Forschergruppe 437 Text-technological
modelling of information - Within this Project a corpus is annotated on
different (linguistic) levels - Aim of the project Inferring, Describing, and
Modelling relations between these levels
4Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
5Standard Methodology
- A corpus is annotated according to a given tag
set - The tag set is defined in a document grammar
(e.g. the TEI-DTD) - In general, different tag sets exist for
annotating different kinds of documents (e.g.
poems, encyclopedia) or different kinds of
information (e.g. linguistic information) - In particular, a linguistic annotation can depend
on - theoretical assumptions
- constituent structure,
- functional structure, or
- a (more) specific theory
- the language
- research questions
6Problems of the standard methodology
- Levels of description are neglected
- or
- Different levels of annotation are mixed up
Difficulties
- Multiple hierarchies within one document
7General Solutions (c.f. TEI-Guidelines)
- concur an optional feature of SGML (not
available in XML) which allows multiple
hierarchies to be marked up concurrently in the
same document - milestone elements empty elements which mark the
boundaries between elements in a non-nesting
structure - fragmentation of an item the division of what
logically is a single element into two or more
parts, each of which nests properly within its
context - virtual joins the recreation of a virtual
element from fragments of text, (requires a
separate interpretation) - redundant encoding of information in multiple
forms
8Multiple hierarchies and language data
- Hypertext linking techniques are used for
connecting multiple layers of annotation, e.g. - Within the EU-Project NITE an annotation format
has been developed which allows for specifying
links between separate annotation layers - The annotation graphs (AGs) format uses a
(possibly abstract) timeline as linking-layer - Modified versions of the AGs are applied by
- the TASX-Annotator
- the EXMARaLDA-Project
9Alternative Methodology
- XML-based multi-layer annotation
- Technically, each layer becomes a separate and
independent XML-document - The same text is annotated several times
- Advantages
- seems to be the only way to annotate multiple
hierarchies without workarounds - each document instance uses its own DTD (or
Schema), i.e. annotation formats are not mixed up
- at any time a new annotation can be produced
- transformation tools to the NITE and the
TASX-format exist (Masters Thesis by Jan F. Maas)
10Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
11Layer vs. Level
- We distinguish annotation level vs. annotation
layer - Annotation level refers to an abstract level of
analysis - Annotation layer refers to the realisation of an
annotation in e.g. XML - Examples of annotation levels morphology in a
linguistic grammar, text structure (sections,
paragraphs,...), layout (lines and pages),
thematic structure, rhetorical structure - Sometimes one layer contains several levels (e.g.
HTML), but a level can also be distributed over
several layers
12Annotation Process
- Given
- the textual representation of language material
(text) - the text is regarded as primary data
- For each annotation layer the primary data is
copied - The (copy of the) primary text is annotated
according to a schema (e.g. a DTD) - Annotation can be prepared
- in any XML-Editor (e.g. XMetaL, XML-Spy,
psgml-emacs) - special purpose annotation tool
13Sample annotation with a web-based, special
purpose annotation tool This tool is used only
for flat xml-structures, i.e. xml-annotations
with non-nested elements
14ExampleXML-Annotation with the emacs
editor(useable for deep and flat annotations)
15Multi-layer-annotation tool (master's thesis by
Stefan Michel work in progress)
16Multiple Annotations
- Drawbacks
- redundant
- the separate documents are independent (i.e. not
connected) - But
- since the documents contain exactly the same
text, the text can function as the link - Solution
- a common representation format for all separate
XML-documents
17Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
18Prolog-Representation
- The Prolog-representation is based on work by
Renear, Huitfeld, Dubin and Sperberg-McQueen - Original representation for an XML-Elementnode/2,
i.e. the predicate node has two arguments - the position in the document tree
- a value, e.g. element(corpus)
- Extension node/2 is replaced by node/5
- The 3 new arguments
- annotation layer
- startingpoint of the annotated text
- end-point of the annotated text
19Conversion from XML to Prolog (xml2prolog)
- Implemented in Python
- Input 1 or more XML-Documents
- Result Collection of Prolog facts
- Example
- the element ltRootgt is represented as the fact
- node(AnnotationLayer, 0, 42332, 1,
element(Root)). - the attribute attval of the Element ltRootgt is
represented as the fact - attr(AnnotationLayer, 0, 42332, 1, 'att',
'val').
20xml2prolog.py
- Some options for the transformation process
- compare the primary data of the XML files are
compared, if the primary data is not identical,
the first difference is shown - pcdata/pcdatanodes character data can be
included - aggressive whitespace is added or removed
anywhere in document if whitespace is the reason
for differences of the primary data - filter some elements in some files should be
filtered (including their textual content), e.g.
ltscriptgt within HTML-documents
21Example
22Example (Collection of Prolog-Facts)
23Example (Collection of Prolog-Facts)
annotation layer start- and endpoint nodes in
DOM-tree element names attribute-value-pair data-c
ontents
24Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
25Relations between annotation Layers
- Relations are inferred automatically
- Special Prolog predicates have been implemented,
for - compare the annotation layers
- Example (Identity)
- ltwgttreelt/wgt
- ltmgttreelt/mgt
- ltsyllgttreelt/syllgt
26Relations between Annotations
Vgl. Durusau Brook O'Donnell (2002) und Durand
(1999)
1. ltagt....................lt/agt
ltbgt......lt/bgt 2. ltagt.................
...lt/agt
ltbgt.........lt/bgt 3. ltagt....................lt/agt
ltbgt.....................lt/bgt 4.
ltagt....................lt/agt
ltbgt................................lt/bgt 5.
ltagt....................lt/agt ltbgt.................
.........................lt/bgt 6.
ltagt....................lt/agt
ltbgt......lt/bgt 7. ltagt....................lt/agt
ltbgt....................lt/bgt 8.
ltagt....................lt/agt ltbgt.................
..............lt/bgt etc.
27Relations between annotation layers
Visualisation
Relation
identity
independence
inclusion
start point identity
end point identity
end point is starting point
overlap
range of element a
range of element b
28Comparison of annotation layers
- We distinguish two kinds of relations between
- elementsrelations between single instances of an
element (relations) - relations between all occurrences of instances an
element (meta-relations) - Prolog programs have been developed to infer both
kinds of relations
29Prolog Implementation
- Aims
- statistics on annotation layers
- relations between occurrences of elements
- meta-relations
30Example deep annotation (HPSG)
31Statistics of the annotation according to HPSG
?- get_statistics. Please enter layer name or
type "q" to exit, "h" for help
hpsg. Statistics for hpsg Number of Nodes 14,
Number of different Elements 5 Number of
Attributes 1, Number of different A/V-pairs
4 ------------------------------------------ Diffe
rent elements and their occurrences
hpsg 1 nodesAndLabels
3 nonannotated-text 4 phrase
2 punctuation
4 ------------------------------------------ Attri
bute occurrences different
values type 5 4 For
information on occurrences of Attribute-Value-Pair
s enter Attribute name or type q to quit.
type. ( edgeCOMP,1 ) , ( edgeHD,2 ) , ( np,1 ) ,
( np-no,1 )
32Relations between occurrences of elements
- Query How often does a certain relation between
elements hold? - chk_relation(Relation,Element1,Layer1,Element2,La
yer2,L). - Relation a relation between elements (e.g.
identity, overlap, or - endA_is_starting_pointB)
- Element1 element name of annotation Layer1
- Element2 element name of annotation Layer2
- L result-list
- It is also possible to infer examples and
counter-examples of a certain relation
33ExampleRelations between elements of the HPSG
Annotation and the elements of a
dialogue-annotation
34Ex. Relations between HPSG-phrases and X
?- chk_relation(Relation,phrase,hpsg,X,dialogue,L)
. Relation identity X _G160 L
Relation included_B_in_A X _G160 L
Relation included_A_in_B X _G160 L
phrase, dialogue, 2, phrase, 2, dialogue,
1 ... Relation overlap_A X _G160 L
Yes
35Meta-relations
- If a certain relation holds for all instances of
an element we defined meta-relation - identity At every occurrence of an element A in
Layer1 an element B in Layer2 exists which spans
the same range of characters - inclusion
- at every occurrence of an element A in Layer1 an
element B in Layer2 exists which is included or
is identical - the meta relation identity does not hold
- overlap At every occurrence of an element A in
Layer1 an element B in Layer2 exists which
overlaps with A - mixed no meta-relations exist
-
36Meta-relations (cntd.)
- identity - For all occurrences, the following
configuration can found - ltagt....................lt/agtltbgt..................
..lt/bgt - inclusion - For all occurrences, one of the
following configurations can be found - ltagt....................lt/agt
ltbgt................................lt/bgt - ltagt....................lt/agt ltbgt...............
...........................lt/bgt - ltagt....................lt/agtlt
bgt.......................................lt/bgt - ltagt....................lt/agtltbgt..................
..lt/bgt - overlap - For all occurrences, the following
configuration can found - ltagt....................lt/agt
ltbgt....................lt/bgt
37Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
38Context specification 1 Motivation
- Often, general Meta-relations do not hold
- In these cases, the elements can be classified
according to structural properties within their
layer - This allows to construct specific Meta-relations
- A format to express the structural properties
called Context Specification Document (CSD) has
been developed
39Context specification 2 Realization
- Subclassification of element nodes via tree
walking automata (TWA) - Underlying path-language for the construction of
TWA Caterpillar-Expressions (cf.
Brüggemann-Klein and Wood, 2000) - moves up, right, left, firstChild, lastChild
- tests isFirst, isLast, isLeaf, isRoot
- test for element names
- Kleene-star operator
40Sample application
NP
COMP
HD
NF
NP.NO
COMP
HD
VN
PGen
k e N
shucchou
no
41Context specification 3 Subclassification
NP
COMP
HD
NF
NP.NO
Relation holds for all Comp Elements
COMP
HD
VN
PGen
Relation holds only for a subset
k e N
shucchou
no
42Contents
- Project description
- Approaches to Multiple Annotations
- multiple Levels
- multiple Layers
- Representation
- Inferences
- Context Specification
- Unification
43Unification of annotation layers I
- Two document layers can be merged
- This process has also been implemented in Prolog
- The predicate (semt) receives four arguments.
- layer1 (to be unified)
- layer2 (to be unified)
- list of elements which should be deleted in the
process of unification - The result of the merger (again a collection of
Prolog facts) is written to a new file specified
in the fourth argument - The new database contains a copy of all layers in
the input database plus the result layer - In case the unification results to a layer where
the elements would not be properly nested, a
second result layer (a difference list) is
created.
44Unification of annotation layers II
- The result database is re-converted to XML using
a python program - If no difference list exists, the result of the
merging of two layers can be linearised as an XML
document straightforwardly - In case the result fact base contains a
difference list, two different linearisations can
be generated. - the default processing uses milestone elements to
mark the borders of incompatible elements. - alternatively, the technique of fragmentation of
elements can be invoked.
45Architecture
Inference/ Query
via Python
XML-docu-ments
Generation of XML from the fact
base Unification of annotation levels
via Python
External information
XML-docu-ments
Rules
Rules
46Contents
- Project description
- Approaches to Multiple Annotations
- Representation
- Inferences
- Context Specification
- Unification
47Relations between multiple annotations
Representation, Inferences, Context
Specification, and Unification
- Andreas Witt
- Dieter Metzing
- Jens Pönninghaus
- Daniela Goecke
www.text-technology.de