Experiences with UIMA from a User - PowerPoint PPT Presentation

Loading...

PPT – Experiences with UIMA from a User PowerPoint presentation | free to download - id: 21cf71-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Experiences with UIMA from a User

Description:

lexica, Gazetteer lists, tools. tools for resource maintenance and extension? ... editor for gazetteer list. corpus manager. text editor and debugger for JAPE rules ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 42
Provided by: IWS
Learn more at: http://wdok.cs.uni-magdeburg.de
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Experiences with UIMA from a User


1
Experiences with UIMA from a Users Perspective
  • Dietmar Rösner,
  • Manuela Kunze,
  • Hany Mahgoub

University of Magdeburg C Knowledge Based Systems
and Document Processing
2
Overview
  • Introduction
  • GATE
  • UIMA
  • Conclusion

3
Introduction
  • November 2005 Version 1.2.3 of UIMA is available
  • "IBMs Unstructured Information Management
    Architecture (UIMA) is an architecture and
    software framework for creating, discovering,
    composing and deploying a broad range of
    multi-modal analysis capabilities and integrating
    them with search technologies."

4
Introduction
really?
5
Introduction
  • similarity/comparison of GATE and UIMA
  • frameworks
  • results are documents annotations
  • pipeline processing
  • steps
  • task definition
  • one corpus

6
Evaluation Topics/Points
  • ease of getting acquainted with system?
  • quality of docus completeness, clarity,
    up-to-date, ?
  • tutorials, use cases, ?
  • processing and linguistic resources?
  • lexica, Gazetteer lists, tools
  • tools for resource maintenance and extension?
  • quality selfexplanatory, robust, comfortable
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • support for im-/export of a variety of document
    formats?

7
Task of the Experiment
  • process a corpus of websites
  • to detect and extract information relevant for
    tourists
  • opening times of museum, prices of hotels,
  • corpus
  • 30 tourism web sites of Egypt
  • additional 20 web sites of Washington, New York,
    London
  • output
  • Prolog facts for a reasoner
  • Questions
  • Which museum is now open?

8
Excerpts from the Corpus
  • The Egyptian Museum is open the hours 9am-5pm
    daily
  • The Military Museum is open the hours Summer
    8am-530pm winter 8am-430pm
  • Palace Museum is open the hours 8am-530pm
    (summer) 8am-430pm (winter)
  • 10am-2pm, 6pm-9pm Sat-Wed 6pm-9pm Fri

9
Overview
  • Introduction
  • GATE
  • UIMA
  • Conclusion

10
GATE General Architecture for Text Engineering
  • a suite of tools for language processing and
    information extraction
  • rule-based modular IE system (ANNIE)
  • language and domain-independent processing
    resources
  • open and extensible architecture
  • aims to provide uniform access to various
    linguistic and ontological resources
  • http//gate.ac.uk/

11
GATE General Architecture for Text Engineering
  • a software infrastructure for NLP researchers
    based on three main elements
  • an architecture
  • describing the components composing a language
    processing system
  • a framework
  • could be used as a basis for building such
    systems
  • a graphical development environment
  • a set of tools and
  • components for language engineers

12
GATE General Architecture for Text Engineering
  • GATE distributed with IE system called ANNIE
  • relies on finite state algorithms and the Java
    Annotation Pattern Engine (JAPE) language
  • comprising a set of core Processing Resources
    (PRs)
  • Tokeniser
  • Gazetteers
  • POS tagger
  • Sentence Splitter
  • Semantic Tagger (JAPE transducer)
  • Orthomatcher (orthographic coreference)

13
GATE ANNIE
Cunningham et al. Developing Language
Processing Components with GATE Version 3 (a
User Guide)
14
Gate Application
  • several Processing Resources Tokenizer, Hash
    Gazetteer (with new/extended Gazetteer lists),
    JAPE Transducer

... The Military Museum Summer 8am-530pm
Winter 9pm-5pm
Gazetteer lists
JAPE Transducer
ANNIE English Tokenizer
  • JAPE rules to annotate
  • interval of times and restrictions
  • museum

names of museums, fragments of times and
restrictions
15
Museum information in JAPE
  • Rule egyptmuseums
  • (
  • (SpaceToken)
  • (Token.kind word)
  • (SpaceToken)
  • Lookup.majorType org_base // from gazetteer
    lists
  • (SpaceToken)?
  • ((Token.kindpunctuation)(Token.kindword)
    (SpaceToken))
  • (timeinfo) // annotation by jape transducer
  • )
  • museum
  • --gt
  • museum.sight rule "egyptmuseums"
  • timeinfo defined by JAPE rules detects patterns
    like
  • 9am-5pm, 6pm-9pm
  • 8am-430pm, 830am-430pm, 830am-4pm
  • 500PM-700PM, 1000am-500pm
  • .

16
GATE Presentation of Results
Type and location of every extracted annotation
on document
Annotations
Museums Information
17
GATE Results
  • information annotated in the documents
  • names of museums, hotels
  • names of tourist places in Egypt
  • times, time intervals
  • time restrictions
  • prices, intervals of prices (hotel prices and
    museum prices)
  • names of pharaohs, queens

18
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • good
  • illustrative examples (tutorial) but not enough
    specialy about JAPE rules
  • can deal with it without know of Java programming
  • but is advantage to have experinces with Java
    programming to use it in JAPE rules

19
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • many processing resources available (ANNIE)
  • tokenisers
  • POS taggers
  • parsers
  • gazetteers
  • sentence splitter
  • additional PRs
  • gazetteer collector
  • PRs for Machine Learning
  • various exporters
  • annotation set transfer etc...

20
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • editor for gazetteer list
  • corpus manager
  • text editor and debugger for JAPE rules

21
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • there is no measurement of processing time in the
    GATE tool

22
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • corpus pipeline vs document pipeline

23
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • no limitations
  • all is possible but it is not necessary to
    implement by yourself
  • for beginning
  • processing and linguistic resources available
    within the distribution

24
GATE Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • import
  • supports a variety of document formats HTML,
    rtf, email, SGML and plain text
  • In all cases the format is analysed and converted
    into a single unified model of annotation
  • export
  • documents, corpora and annotations in databases
    of various sorts
  • required Java application (CREOLE)

25
Overview
  • Introduction
  • GATE
  • UIMA
  • Conclusion

26
UIMA Unstructured Information Management
Architecture
  • a software architecture for developing and
    deploying unstructured information management
    (UIM) applications
  • UIM application a software system
  • analyse large volumes of unstructured information
    to
  • discover,
  • organize, and
  • deliver relevant knowledge to the end user
  • software architecture which specifies
  • component interfaces, data representations,
  • http//www.research.ibm.com/UIMA/

27
UIMA Unstructured Information Management
Architecture
interfaces to a collection of data items (e.g.,
documents) to be analyzed. Collection Readers
return CASes that contain the documents
to analyze, possibly along with additional
metadata.
takes a CAS, analyzes its contents, and
produces an enriched CAS. Analysis Engines can be
recursively composed of other Analysis
Engines (called an Aggregate Analysis Engine).
Aggregates may also contain CAS Consumers.
may be used by a Collection Reader to populate
a CAS from a document. An example of a CAS
Initializer is an HTML parser that de-tags an
HTML document and also inserts paragraph
annotations (determined from ltPgt tags in the
original HTML) into the CAS.
CAS Common Analysis Structure CPM Collecting
Processing Manager
consume the enriched CAS that was produced by
the sequence of Analysis Engines before it, and
produce an application-specific data structure,
such as a search engine index or database.
Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
28
UIMA Unstructured Information Management
Architecture
  • Analysis Engine (AE)
  • a component that analyzes artifacts (e.g.
    documents) and infers information about them
  • consists of two parts
  • Java classes (typically packaged as one or more
    JAR files) and
  • AE descriptors (one or more XML files)
  • the configuration settings for the Analysis
    Engine as well as
  • a description of the AEs input and output
    requirements.

Ferucci et al. Unstructured Information
Management Architecture (UIMA) SDK User's Guide
and Reference
29
UIMA Application
  • several annotators (like a pipeline)

... Fraunces Tavern Museum 54 Pearl St. -
1-212-425-1778 Tuesday-Friday, 12pm?5pm
regular expressions
restrictions
Prolog facts museumopen('Fraunces Tavern Museum
', '2005-12-01T120000', '2005-12-01T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-02T120000', '2005-12-02T170000').
museumopen('Fraunces Tavern Museum
', '2005-12-03T120000', '2005-12-03T170000').
interval of times
museum information
time pattern
window covering two time intervals and a
restriction
museum pattern
regular expressions
window covering a museum and opening hours
regular expressions
30
UIMA Results
  • information annotated in the documents
  • names of museums, hotels
  • times, time intervals
  • time restrictions
  • prices, intervals of prices (hotel prices)
  • keywords for museum category
  • names of pharaohs (annotated with a correction of
    mispellings)
  • hotel and museum information are exported into
    Prolog facts and into a short textual summary
  • templates filled with the detected information
  • hotels Price information about Cosmopolitan
    Hotel 157
  • museums
  • Fraunces Tavern Museum
  • Open from 120000 to 170000
  • Restriction Tuesday-Friday

31
UIMA Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • good
  • illustrative examples (tutorial)
  • completeness sometimes it is very shortly
    described
  • prior knowledge about Java and Eclipse is helpful

32
UIMA Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • annotators only from tutorial
  • sentence annotation
  • word annotation
  • date/time annotators
  • examples for using regular expressions etc.
  • external resources can be integrated
  • lexical resources as external resources (text
    files)
  • existing processing resources
  • implementation of an interface is necessary

33
UIMA Evaluation
  • documentation?
  • processing and linguistic resources?
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • specific Eclipse component editors or
  • simple text Editors

34
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • faster than GATE?
  • in CPE detailed information about processing time
    for each module

35
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • Collection Reader
  • document(s) from a directory
  • adapt extensions into Preprocessing (CAS
    Initializer)
  • e.g., extraction of text fragments from a HTML
    document

36
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • no limitations
  • all is possible, but implementation or
    interfacing by user
  • wish
  • more processing and linguistic resources within
    the distribution

37
UIMA Evaluation
  • documentation
  • processing and linguistic resources
  • tools for resource maintenance and extension?
  • speed of processing?
  • single docs vs. large corpora?
  • limitations, suggestions for improvement?
  • im-/export of document formats?
  • import CAS Initializer
  • export CAS Consumer
  • transform annotations in any other format
  • export of
  • document annotations
  • only annotations
  • required Java application

38
Overview
  • Introduction
  • GATE
  • UIMA
  • Conclusion

39
Conclusion
  • intended use
  • GATE academic/scientific application
  • tools available
  • comfortable GUI
  • UIMA more commercial
  • plain framework
  • simplified definition of (complex) results
    structures
  • simplified pre- and postprocessing of annotations
  • in sum incommensurable

40
Conclusion
  • both are extensible
  • no final judgement about use GATE or UIMA
  • depends on
  • your task
  • task description
  • expected results
  • which processing resources are necessary
  • your preferences for interface
  • prefer the Eclispe environment (or other Java
    editors)
  • prefer a comfortable GUI
  • or use both

41
Conclusion
  • found in the UIMA Forum
  • I see UIMA and GATE as complementary rather than
    competitive, and each can gain from the strengths
    of the other.
  • GATE was originally developed as a research
    tool, and has features suited to rapid
    prototyping of text processing code, like JAPE (a
    language for defining finite-state transducers
    over annotations on a document).
  • UIMA is more targetted at robust deployment of
    applications, with strong typing of feature
    structures and better support for distributed
    processing. We're currently working on writing a
    translation layer to allow UIMA analysis
    components to be used in GATE and vice-versa.
    It's not in a releasable state just yet, but we
    hope to release something in the near future.
    Keep your eye on http//gate.ac.uk/ for details.
  • Ian Roberts (GATE developer)
About PowerShow.com