Experiences with UIMA in NLP teaching and research - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Experiences with UIMA in NLP teaching and research

Description:

news about FIFA world cup 2006 in Germany, description of drugs, announcements ... processing tourism web sites, news about the FIFA world cup 2006 in Germany, ... – PowerPoint PPT presentation

Number of Views:343

Avg rating:3.0/5.0

Slides: 33

Provided by: IWS

Category:

more less

Transcript and Presenter's Notes

Title: Experiences with UIMA in NLP teaching and research

1
Experiences with UIMA in NLP teaching and research

Manuela Kunze,
Dietmar Rösner

University of Magdeburg C Knowledge Based Systems
and Document Processing
2
Overview

What is UIMA?
First Experiments
NLP Teaching
Conclusion

3
UIMA Unstructured Information Management
Architecture

a software architecture for developing and
deploying unstructured information management
(UIM) applications
UIM application a software system
analyse large volumes of unstructured information
to
discover,
organize, and
deliver relevant knowledge to the end user
software architecture which specifies
component interfaces, data representations,
http//www.research.ibm.com/UIMA/

4
UIMA Unstructured Information Management
Architecture
interfaces to a collection of data items (e.g.,
documents) to be analyzed. Collection Readers
return CASes that contain the documents
to analyze, possibly along with additional
metadata.
takes a CAS, analyzes its contents, and
produces an enriched CAS. Analysis Engines can be
recursively composed of other Analysis
Engines (called an Aggregate Analysis Engine).
Aggregates may also contain CAS Consumers.
may be used by a Collection Reader to populate
a CAS from a document. An example of a CAS
Initializer is an HTML parser that de-tags an
HTML document and also inserts paragraph
annotations (determined from ltPgt tags in the
original HTML) into the CAS.
CAS Common Analysis Structure CPE Collecting
Processing Manager
consume the enriched CAS that was produced by
the sequence of Analysis Engines before it, and
produce an application-specific data structure,
such as a search engine index or database.
Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
5
UIMA Unstructured Information Management
Architecture

Analysis Engine (AE)
a component that analyzes artifacts (e.g.
documents) and infers information about them
consists of two parts
Java classes (typically packaged as one or more
JAR files) and
AE descriptors (one or more XML files)
the configuration settings for the Analysis
Engine as well as
a description of the AEs input and output
requirements.

6
UIMA Unstructured Information Management
Architecture

describe analysis engine
annotator class
input parameter
output of annotations
external resources
interface
resources

Java
XML
define an annotator
analysis engine
linked to a type system
uses
type system
Annotation Interface
create

define annotation type
name
features (begin, end, )

7
UIMA Unstructured Information Management
Architecture

Aggregate Analysis Engine
combine different analysis engine within one
Analysis Engine

Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
8
Overview

Introduction
First Experiments
NLP Teaching
Conclusion

9
First Experiments UIMA vs. GATE

base line
2 persons, 2 systems, 1 corpus and 1 extraction
task
skills/experiences of the persons

UIMA GATE Eclipse/Java
Person 1
Person 2
J
J
K
K
K
J
10
Task of the Experiment

process a corpus of websites
to detect and extract information relevant for
tourists
opening times of museum, prices of hotels,
corpus
30 tourism web sites of Egypt
additional 20 web sites of Washington, New York,
London
output
Prolog facts for a reasoner
Questions
Which museum is now open?

11
Evaluation Topics/Points

ease of getting acquainted with system?
quality of docus completeness, clarity,
up-to-date, ?
tutorials, use cases, ?
processing and linguistic resources?
lexica, Gazetteer lists, tools
tools for resource maintenance and extension?
quality selfexplanatory, robust, comfortable
speed of processing?
single document vs. large corpora?
limitations, suggestions for improvement?
support for im-/export of a variety of document
formats?

12
Excerpts from the Corpus

The Egyptian Museum is open the hours 9am-5pm
daily
The Military Museum is open the hours Summer
8am-530pm winter 8am-430pm
Palace Museum is open the hours 8am-530pm
(summer) 8am-430pm (winter)
10am-2pm, 6pm-9pm Sat-Wed 6pm-9pm Fri

13
UIMA Application

several annotators (like a pipeline)

... Fraunces Tavern Museum 54 Pearl St. -
1-212-425-1778 Tuesday-Friday, 12pm?5pm
regular expressions
restrictions
Prolog facts museumopen('Fraunces Tavern Museum
', '2005-12-01T120000', '2005-12-01T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-02T120000', '2005-12-02T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-03T120000', '2005-12-03T170000').
interval of times
museum information
time pattern
window covering two time intervals and a
restriction
museum pattern
regular expressions
window covering a museum and opening hours
regular expressions
14
UIMA Results

information annotated in the documents
names of museums, hotels
times, time intervals
time restrictions
prices, intervals of prices (hotel prices)
keywords for museum category
names of pharaohs (annotated with a correction of
mispellings)
information about hotel and museum are exported
into Prolog facts and into a short textual
summary
templates filled with the detected information
hotels Price information about Cosmopolitan
Hotel 157
museums
Fraunces Tavern Museum
Open from 120000 to 170000
Restriction Tuesday-Friday

15
UIMA vs. GATE Conclusion

no final judgement about use GATE or UIMA
depends on
your task
task description
expected results
which processing resources are necessary
your preferences for interface
prefer the Eclispe environment (or other Java
editors)
prefer a comfortable GUI

16
UIMA vs. GATE Conclusion

GATE
tools available
comfortable GUI
UIMA
plain framework
simplified definition of (complex) result
structures
simplified pre- and postprocessing of annotations
both are extensible
e.g. for processing German documents

17
'German' Extension of Processing Resources

XDOC document suite
tools for processing German documents
tools implemented in CommonLisp
for UIMA
Java reimplementation of the tools
several analysis engines

18
XDOC in UIMA

annotation of
part-of-speech (Morphix, heuristics)
semantic categories
named entities (vehicles, cities, )
a coarse approach for classification of PP
using maxent library

19
UIMA Evaluation

good
illustrative examples (tutorial)
completeness sometimes it is very shortly
described
experiences with Eclipse and Java programming are
advantageous
prior knowledge about Java and Eclipse is helpful

documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

20
UIMA Evaluation

documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

annotators only from tutorial
sentence annotation
word annotation
date/time annotators
examples for using regular expressions etc.
external resources can be integrated
lexical resources as external resources (text
files)
existing processing resources
implementation of an interface is necessary

21
UIMA Evaluation

documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

specific Eclipse component editors or
simple text editors

22
UIMA Evaluation

documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

faster than GATE?
in CPE detailed information about processing time
for each module

23
UIMA Evaluation

documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

Collection Reader
document(s) from a directory

24
UIMA Evaluation

documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

no limitations
all is possible, but implementation or
interfacing by user
wish
more processing and linguistic resources within
the distribution

25
UIMA Evaluation

documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?

import CAS Initializer
export CAS Consumer
transform annotations in any other format
export of
document annotations
only annotations
required Java application

26
Overview

Introduction
First Experiments
NLP Teaching
Conclusion

27
NLP Teaching

course Information Extraction
aim of the course to make our students
acquainted with information extraction as basic
NLP technology
UIMA, GATE
students computer science, data-knowledge
engineering
skills of the students programming Java

28
NLP Teaching

different corpora
news about FIFA world cup 2006 in Germany,
description of drugs,
announcements of new books,
tasks for students
to develop different anaylsis engines and combine
them for annotation of
URLs,
email addresses,
name of players,
results of games,
using regular expressions, external resources,
maximum entropy models

29
NLP Teaching
30
UIMA A Students View

easy to handle
Java programming (environment)
problems of students
to understand the dependencies between the
several descriptors
for teaching helpful (future work)
a 'comparator' of different solutions of students
which solution is the best, related to a 'master'
solution

31
Overview

Introduction
First Experiments
NLP Teaching
Conclusion

32
Conclusion

UIMA
easy to learn and to handle
support the management of
different annotations
different processing resources
integration of external resources (processing
resources as well lexical resources)
splitting of 'processing steps'
reader, initalizer, analysis engine, consumer
'wish-list'
a kind of jape transducer
interface to GATE's processing resources is
available
'comparator' for evaluation of solutions

33
(No Transcript)
34
XDOC in UIMA
35
Introduction
really?
36
Introduction

first experiments with UIMA
processing tourism web sites, news about the FIFA
world cup 2006 in Germany,
integration of tools from the XDOC document suite
using UIMA in a course on Information Extraction

37
Introduction

November 2005 Version 1.2.3 of UIMA is available

"IBMs Unstructured Information Management
Architecture (UIMA) is an architecture and
software framework for creating, discovering,
composing and deploying a broad range of
multi-modal analysis capabilities and integrating
them with search technologies."

Write a Comment

User Comments (0)