Introduction to ANNIE - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to ANNIE

Description:

Typically a new application will use most of the core components ... JAPE is a pattern-matching language. The LHS of each rule contains patterns to be matched ... – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 21

Provided by: Dia571

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to ANNIE

1
Introduction to ANNIE
http//gate.ac.uk/ http//nlp.shef.ac.uk/

Diana Maynard
University of Sheffield
March 2004

2
What is ANNIE?

ANNIE is a vanilla information extraction system
comprising a set of core PRs
Tokeniser
Sentence Splitter
POS tagger
Gazetteers
Semantic tagger (JAPE transducer)
Orthomatcher (orthographic coreference)

3
ANNIE Pipeline
4
Other Processing Resources

There are also lots of additional processing
resources which are not part of ANNIE itself but
which come with the default installation of GATE
Gazetteer collector
PRs for Machine Learning
Various exporters
Annotation set transfer
etc.

5
Creating a new application from ANNIE

Typically a new application will use most of the
core components from ANNIE
The tokeniser, sentence splitter and orthomatcher
are basically language, domain and
application-independent
The POS tagger is language dependent but domain
and application-independent
The gazetteer lists and JAPE grammars may act as
a starting point but will almost certainly need
to be modified
You may also require additional PRs (either
existing or new ones)

6
Modifying gazetteers

Gazetteers are plain text files containing lists
of names
Each gazetteer set has an index file listing all
the lists, plus features of each list (majorType,
minorType and language)
Lists can be modified either internally using
Gaze, or externally in your favourite editor
Gazetteers can also be mapped to ontologies
To use Gaze and the ontology editor, you need to
download the relevant creole files

7
JAPE grammars

A semantic tagger consists of a set of rule-based
JAPE grammars run sequentially
JAPE is a pattern-matching language
The LHS of each rule contains patterns to be
matched
The RHS contains details of annotations (and
optionally features) to be created
More complex rules can also be created

8
Input specifications

The head of each grammar phase needs to contain
certain information
Phase name
Inputs
Matching style
e.g.
Phase location
Input Token Lookup Number
Control appelt

9
Matching algorithms and Rule Priority

3 styles of matching
Brill (fire every rule that applies)
First (shortest rule fires)
Appelt (use of priorities)
Appelt priority is applied in the following order
Starting point of a pattern
Longest pattern
Explicit priority (default -1)

10
NE Rule in JAPE Rule Company1 Priority 25
( ( Token.orthography
upperInitial ) //from tokeniser
Lookup.kind companyDesignator //from
gazetteer lists )match --gt
match.NamedEntity kindcompany,
ruleCompany1
11
LHS of the rule

LHS is expressed in terms of existing
annotations, and optionally features and their
values
Any annotation to be used must be included in the
input header
Any annotation not included in the input header
will be ignored (e.g. whitespace)
Each annotation is enclosed in curly braces
Each pattern to be matched is enclosed in round
brackets and has a label attached

12
Macros

Macros look like the LHS of a rule but have no
label
Macro NUMBER
((Digit))
They are used in rules by enclosing the macro
name in round brackets
( (NUMBER))match
Conventional to name macros in uppercase letters
Macros hold across an entire set of grammar phases

13
Contextual information

Contextual information can be specified in the
same way, but has no label
Contextual information will be consumed by the
rule
(Annotation1)
(Annotation2)match
(Annotation3)
?

14
RHS of the rule

LHS and RHS are separated by ?
Label matches that on the LHS
Annotation to be created follows the label
(Annotation1)match
? match.NE feature1 value1, feature2
value2

15
Using phases

Grammars usually consist of several phases, run
sequentially
Only one rule within a single phase can fire
Temporary annotations may be created in early
phases and used as input for later phases
Annotations from earlier phases may need to be
combined or modified
A definition phase (conventionally called
main.jape) lists the phases to be used, in order
Only the definition phase needs to be loaded

16
More complex JAPE rules

Any Java code can be used on the RHS of a rule
This is useful for e.g. feature percolation,
ontology population, accessing information not
readily available, comparing feature values,
deleting existing annotations etc.
There are examples of these in the user guide and
in the ANNIE NE grammars
Most JAPE rules end up being complex!

17
Using JAPE for other tasks

JAPE grammars are not just useful for NE
annotation
They can be a quick and easy way of performing
any kind of task where patterns can be easily
recognised and a finite-state approach is
possible, e.g. transforming one style of markup
into another, deriving features for the learning
algorithms

18
Example rule for deriving features

Rule Entity( Gpe Organization
Person Location Facility
)entity--gtgate.AnnotationSet entityAS
(gate.AnnotationSet)bindings.get("entity")
gate.Annotation entityAnn (gate.Annotation)entit
yAS.iterator().next()
gate.FeatureMap features Factory.newFeatureMap()
features.put("type", entityAnn.getType())outputA
S.add(entityAnn.getStartNode(),
entityAnn.getEndNode(),
"Entity, features)

19
Finding Examples