GATE, a General Architecture for Text Engineering - PowerPoint PPT Presentation

About This Presentation
Title:

GATE, a General Architecture for Text Engineering

Description:

One of the largest Human Language Technology groups in the EU ... A focus on scientific method in AI (participate in all the leading quantitative ... – PowerPoint PPT presentation

Number of Views:600
Avg rating:3.0/5.0
Slides: 28
Provided by: ham48
Category:

less

Transcript and Presenter's Notes

Title: GATE, a General Architecture for Text Engineering


1
  • GATE, a General Architecture for Text Engineering
  • http//gate.ac.uk/ http//nlp.shef.ac.uk/
  • Hamish Cunningham
  • Department of Computer Science, University of
    Sheffield
  • ENST, Paris, 20/1/2003
  • Natural Language Engineering in Sheffield
  • One of the largest Human Language Technology
    groups in the EU
  • 50 staff in Language and Speech Processing 25 in
    Information Retrieval, including 6 professors
  • A focus on scientific method in AI (participate
    in all the leading quantitative evaluation
    programmes in the US)
  • A focus on engineering high-quality open-source
    software for applications and demonstrators

2
  •                                                
                                                    
                               
  • GATE, a General Architecture for Text Engineering
  • GATE is.
  • An architectureA macro-level organisational
    picture for LE software systems.
  • A frameworkFor programmers, GATE is an
    object-oriented class library that implements the
    architecture.
  • A development environmentFor language
    engineers, computational linguists et al, GATE is
    a graphical development environment bundled with
    a set of tools for doing e.g. Information
    Extraction.
  • Free software (LGPL). Mature robust software (in
    development since 1995). Download at
    http//gate.ac.uk/download
  • Comes with
  • Some free components... ...and wrappers for
    other people's components
  • Tools for evaluation visualise/edit
    persistence IR IE dialogue ontologies etc.

3
  •                             Applications
    languages
  • GATE has been used for a variety of applications,
    including
  • MUMIS automatic creation of semantic indexes
    for multimedia programme material
  • MUSE a multi-genre IE system
  • EMILLE a 70 million word corpus of Indic
    languages
  • Metadata for Medline (at Merck)
  • Creation of metadata for Semantic Web Services
    documentation using NLG
  • HSE summarisation of health and safety
    information from company reports
  • OldBaileyIE NE recognition on 17th century Old
    Bailey Court reports.
  • AKT language technology in knowledge management
  • AMITIES call centre automation
  • Digital libraries / e-philology for ancient
    languages researchers
  • Various Medical Informatics and database
    technology projects
  • IE in Romanian, Bulgarian, Greek, Bengali,
    Spanish, Swedish, German, Italian, and French
    (Arabic, Chinese and Russian next year)

4
Some users
  • At time of writing a representative fraction of
    GATE users includes
  • Longman Pearson publishing, UK
  • BT Exact Technologies, UK
  • Merck KgAa, Germany
  • Canon Europe, UK
  • Knight Ridder (the second biggest US news
    publisher)
  • BBN Technologies, US
  • Sirma AI Ltd., Bulgaria
  • Resco AB, Sweden/Finland/Germany
  • Glaxo Smith Kline Plc drug-based navigation of
    Medline abstracts
  • Master Foods NV extraction of commodities events
    from news
  • the American National Corpus project, US
  • Imperial College, London, the University of
    Manchester, Queen Mary College, UMIST, the
    University of Karlsruhe, Vassar College, ISI /
    the University of Southern California and a large
    number of other UK, US and EU Universities
  • the Perseus Digital Library project, Tufts
    University, US.

5
  •                                                
                                                    
                               
  • Architectural principles
  • Non-prescriptive, theory neutral (strength and
    weakness)
  • Re-use, interoperation, not reimplementation
    (e.g. diverse XML support, integration of tools
    like Protégé, Jena and Weka)
  • (Almost) everything is a component, and
    component sets are user-extendable
  • Component-based development
  • An OO way of chunking software Java Beans
  • GATE components CREOLE modified Java Beans
    (Collection of REusable Objects for Language
    Engineering)
  • The minimal component 10 lines of Java, 10
    lines of XML, 1 URL.

6
  •                                                
                                                    
                               
  • GATE Language Resources
  • GATE LRs are documents, ontologies, corpora,
    lexicons,
  • Documents / corpora
  • GATE documents loaded from local files or the
    web...
  • Diverse document formats text, html, XML,
    email, RTF, SGML.
  • Processing Resourcres
  • Algorithmic components knows as PRs beans with
    execute methods.
  • All PRs can handle Unicode data by default.
  • Clear distinction between code and data (simple
    repurposing).
  • 20-30 freebies with GATE
  • e.g. Named entity recognition WordNet Protégé
    Ontology OntoGazetteer DAMLOIL export
    Information Retrieval based on Lucene

7
Visual Resources
8
Displaying Coreference Information
9
Displaying Syntactic Information
10
Lexicon Support WordNet example
11
A Language AnalysisExample
12
Building IE Components in GATE (1) The ANNIE
system a reusable and easily extendable set of
components
13
  •  Building IE Components in GATE (2)
  • JAPE a Java Annotation Patterns Engine
  • Light, robust regular-expression-based
    processing
  • Cascaded finite state transduction
  • Low-overhead development of new components
  • Rule Company1
  • Priority 25
  • (
  • ( Token.orthography upperInitial )
  • Lookup.kind companyDesignator
  • )companyMatch
  • --gt
  • companyMatch.NamedEntity kind
    company, rule Company1

14
  •  Performance Evaluation
  • At document level annotation diff
  • At corpus level corpus benchmark tool
    tracking systems performance over time

15
Regression Testing Corpus Benchmark Tool
16
The Semantic Web and GATE
  • GATE is being used for development of
    (semi-)automatic methods for
  • linking web pages to Ontologies using
    Information Extraction
  • learning and evolving Ontologies via IE and
    lexical semantic network traversal.

17
Populating Ontologies with IE
18
Protégé and Ontology Management
19
Information Retrieval Support Based on the
Lucene IR engine
20
Editing Multilingual Data
  •                      
  • GATE Unicode Kit (GUK)
  • Java provides no special support for text input
    (this may change)
  • Support for defining additional Input
    Methods (IMs)
  • currently 30 IMs for 17 languages
  • Pluggable in other applications

21
Processing Multilingual Data All the
visualisation and editing tools for ML LRs use
enhanced Java facilities
22
Dialogue Systems
  • GATE is being used in the Amities project for
    automating call centres
  • Creation of dialogue processing server
    components to run in the Galaxy Communicator
    architecture
  • Easy adaptation of the portable IE components to
    work on noisy ASR output
  • Robustness and speed of GATE components vital
    for real-time dialogue systems

23
The MUMIS project
  • Multimedia Indexing and Searching Environment
  • Composite index of a multimedia programme from
    multiple sources in different languages
  • ASR, video processing, information extraction
    (Dutch, English, German), merging, user interface
  • University of Twente/CTIT, University of
    Sheffield, University of Nijmegen, DFKI, MPI,
    ESTEAM AB, VDA
  • Yorick Wilks, Hamish Cunningham, Horacio Saggion,
    Kalina Bontcheva, Diana Maynard, Oana Hamza,
    Cristian Ursu

24
The Whole Picture
Ontology Lexicon
IE
DE
Formal Text
Formal Text
Final Annotations
Formal Text
IE
Formal Text
NL
Formal Text
Formal Text
Formal Text
EN
Formal Text
Formal Text
Text Sources
IE
Video Audio Signal
Query
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Multimedia Data Base
Formal Text
Speech Signals
Formal Text
User Interface
Trans criptions
ASR
Results
25
User Interface
26
Play
27
  •                                                
                                                    
                               
  • Conclusion
  • GATE an infrastructure that lowers the overhead
    of creating embedding robust NLP components
  • Further information http//gate.ac.uk/
  • Online demos, tutorials and documentation
  • Software downloads
  • Talks and papers
Write a Comment
User Comments (0)
About PowerShow.com