Experiences with UIMA in NLP teaching and research - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Experiences with UIMA in NLP teaching and research

Description:

news about FIFA world cup 2006 in Germany, description of drugs, announcements ... processing tourism web sites, news about the FIFA world cup 2006 in Germany, ... – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 33
Provided by: IWS
Category:

less

Transcript and Presenter's Notes

Title: Experiences with UIMA in NLP teaching and research


1
Experiences with UIMA in NLP teaching and research
  • Manuela Kunze,
  • Dietmar Rösner

University of Magdeburg C Knowledge Based Systems
and Document Processing
2
Overview
  • What is UIMA?
  • First Experiments
  • NLP Teaching
  • Conclusion

3
UIMA Unstructured Information Management
Architecture
  • a software architecture for developing and
    deploying unstructured information management
    (UIM) applications
  • UIM application a software system
  • analyse large volumes of unstructured information
    to
  • discover,
  • organize, and
  • deliver relevant knowledge to the end user
  • software architecture which specifies
  • component interfaces, data representations,
  • http//www.research.ibm.com/UIMA/

4
UIMA Unstructured Information Management
Architecture
interfaces to a collection of data items (e.g.,
documents) to be analyzed. Collection Readers
return CASes that contain the documents
to analyze, possibly along with additional
metadata.
takes a CAS, analyzes its contents, and
produces an enriched CAS. Analysis Engines can be
recursively composed of other Analysis
Engines (called an Aggregate Analysis Engine).
Aggregates may also contain CAS Consumers.
may be used by a Collection Reader to populate
a CAS from a document. An example of a CAS
Initializer is an HTML parser that de-tags an
HTML document and also inserts paragraph
annotations (determined from ltPgt tags in the
original HTML) into the CAS.
CAS Common Analysis Structure CPE Collecting
Processing Manager
consume the enriched CAS that was produced by
the sequence of Analysis Engines before it, and
produce an application-specific data structure,
such as a search engine index or database.
Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
5
UIMA Unstructured Information Management
Architecture
  • Analysis Engine (AE)
  • a component that analyzes artifacts (e.g.
    documents) and infers information about them
  • consists of two parts
  • Java classes (typically packaged as one or more
    JAR files) and
  • AE descriptors (one or more XML files)
  • the configuration settings for the Analysis
    Engine as well as
  • a description of the AEs input and output
    requirements.

6
UIMA Unstructured Information Management
Architecture
  • describe analysis engine
  • annotator class
  • input parameter
  • output of annotations
  • external resources
  • interface
  • resources

Java
XML
define an annotator
analysis engine
linked to a type system
uses
type system
Annotation Interface
create
  • define annotation type
  • name
  • features (begin, end, )

7
UIMA Unstructured Information Management
Architecture
  • Aggregate Analysis Engine
  • combine different analysis engine within one
    Analysis Engine

Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
8
Overview
  • Introduction
  • First Experiments
  • NLP Teaching
  • Conclusion

9
First Experiments UIMA vs. GATE
  • base line
  • 2 persons, 2 systems, 1 corpus and 1 extraction
    task
  • skills/experiences of the persons

UIMA GATE Eclipse/Java
Person 1
Person 2
J
J
K
K
K
J
10
Task of the Experiment
  • process a corpus of websites
  • to detect and extract information relevant for
    tourists
  • opening times of museum, prices of hotels,
  • corpus
  • 30 tourism web sites of Egypt
  • additional 20 web sites of Washington, New York,
    London
  • output
  • Prolog facts for a reasoner
  • Questions
  • Which museum is now open?

11
Evaluation Topics/Points
  • ease of getting acquainted with system?
  • quality of docus completeness, clarity,
    up-to-date, ?
  • tutorials, use cases, ?
  • processing and linguistic resources?
  • lexica, Gazetteer lists, tools
  • tools for resource maintenance and extension?
  • quality selfexplanatory, robust, comfortable
  • speed of processing?
  • single document vs. large corpora?
  • limitations, suggestions for improvement?
  • support for im-/export of a variety of document
    formats?

12
Excerpts from the Corpus
  • The Egyptian Museum is open the hours 9am-5pm
    daily
  • The Military Museum is open the hours Summer
    8am-530pm winter 8am-430pm
  • Palace Museum is open the hours 8am-530pm
    (summer) 8am-430pm (winter)
  • 10am-2pm, 6pm-9pm Sat-Wed 6pm-9pm Fri

13
UIMA Application
  • several annotators (like a pipeline)

... Fraunces Tavern Museum 54 Pearl St. -
1-212-425-1778 Tuesday-Friday, 12pm?5pm
regular expressions
restrictions
Prolog facts museumopen('Fraunces Tavern Museum
', '2005-12-01T120000', '2005-12-01T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-02T120000', '2005-12-02T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-03T120000', '2005-12-03T170000').
interval of times
museum information
time pattern
window covering two time intervals and a
restriction
museum pattern
regular expressions
window covering a museum and opening hours
regular expressions
14
UIMA Results
  • information annotated in the documents
  • names of museums, hotels
  • times, time intervals
  • time restrictions
  • prices, intervals of prices (hotel prices)
  • keywords for museum category
  • names of pharaohs (annotated with a correction of
    mispellings)
  • information about hotel and museum are exported
    into Prolog facts and into a short textual
    summary
  • templates filled with the detected information
  • hotels Price information about Cosmopolitan
    Hotel 157
  • museums
  • Fraunces Tavern Museum
  • Open from 120000 to 170000
  • Restriction Tuesday-Friday

15
UIMA vs. GATE Conclusion
  • no final judgement about use GATE or UIMA
  • depends on
  • your task
  • task description
  • expected results
  • which processing resources are necessary
  • your preferences for interface
  • prefer the Eclispe environment (or other Java
    editors)
  • prefer a comfortable GUI

16
UIMA vs. GATE Conclusion
  • GATE
  • tools available
  • comfortable GUI
  • UIMA
  • plain framework
  • simplified definition of (complex) result
    structures
  • simplified pre- and postprocessing of annotations
  • both are extensible
  • e.g. for processing German documents

17
'German' Extension of Processing Resources
  • XDOC document suite
  • tools for processing German documents
  • tools implemented in CommonLisp
  • for UIMA
  • Java reimplementation of the tools
  • several analysis engines

18
XDOC in UIMA
  • annotation of
  • part-of-speech (Morphix, heuristics)
  • semantic categories
  • named entities (vehicles, cities, )
  • a coarse approach for classification of PP
  • using maxent library

19
UIMA Evaluation
  • good
  • illustrative examples (tutorial)
  • completeness sometimes it is very shortly
    described
  • experiences with Eclipse and Java programming are
    advantageous
  • prior knowledge about Java and Eclipse is helpful
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?

20
UIMA Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • annotators only from tutorial
  • sentence annotation
  • word annotation
  • date/time annotators
  • examples for using regular expressions etc.
  • external resources can be integrated
  • lexical resources as external resources (text
    files)
  • existing processing resources
  • implementation of an interface is necessary

21
UIMA Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • specific Eclipse component editors or
  • simple text editors

22
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • faster than GATE?
  • in CPE detailed information about processing time
    for each module

23
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • Collection Reader
  • document(s) from a directory

24
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • no limitations
  • all is possible, but implementation or
    interfacing by user
  • wish
  • more processing and linguistic resources within
    the distribution

25
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • import CAS Initializer
  • export CAS Consumer
  • transform annotations in any other format
  • export of
  • document annotations
  • only annotations
  • required Java application

26
Overview
  • Introduction
  • First Experiments
  • NLP Teaching
  • Conclusion

27
NLP Teaching
  • course Information Extraction
  • aim of the course to make our students
    acquainted with information extraction as basic
    NLP technology
  • UIMA, GATE
  • students computer science, data-knowledge
    engineering
  • skills of the students programming Java

28
NLP Teaching
  • different corpora
  • news about FIFA world cup 2006 in Germany,
  • description of drugs,
  • announcements of new books,
  • tasks for students
  • to develop different anaylsis engines and combine
    them for annotation of
  • URLs,
  • email addresses,
  • name of players,
  • results of games,
  • using regular expressions, external resources,
    maximum entropy models

29
NLP Teaching
30
UIMA A Students View
  • easy to handle
  • Java programming (environment)
  • problems of students
  • to understand the dependencies between the
    several descriptors
  • for teaching helpful (future work)
  • a 'comparator' of different solutions of students
  • which solution is the best, related to a 'master'
    solution

31
Overview
  • Introduction
  • First Experiments
  • NLP Teaching
  • Conclusion

32
Conclusion
  • UIMA
  • easy to learn and to handle
  • support the management of
  • different annotations
  • different processing resources
  • integration of external resources (processing
    resources as well lexical resources)
  • splitting of 'processing steps'
  • reader, initalizer, analysis engine, consumer
  • 'wish-list'
  • a kind of jape transducer
  • interface to GATE's processing resources is
    available
  • 'comparator' for evaluation of solutions

33
(No Transcript)
34
XDOC in UIMA
35
Introduction
really?
36
Introduction
  • first experiments with UIMA
  • processing tourism web sites, news about the FIFA
    world cup 2006 in Germany,
  • integration of tools from the XDOC document suite
  • using UIMA in a course on Information Extraction

37
Introduction
  • November 2005 Version 1.2.3 of UIMA is available
  • "IBMs Unstructured Information Management
    Architecture (UIMA) is an architecture and
    software framework for creating, discovering,
    composing and deploying a broad range of
    multi-modal analysis capabilities and integrating
    them with search technologies."
Write a Comment
User Comments (0)
About PowerShow.com