Where the Web Went Wrong - PowerPoint PPT Presentation

About This Presentation
Title:

Where the Web Went Wrong

Description:

annoy half the audience. annoy the other half. eCulture, metadata and human language ... The web promotes diversity, but also fragmentation ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 22
Provided by: ham48
Category:
Tags: annoy | web | went | wrong

less

Transcript and Presenter's Notes

Title: Where the Web Went Wrong


1
Where the Web Went Wrong http//gate.ac.uk/
http//nlp.shef.ac.uk/ Hamish
Cunningham Dept. Computer Science, University of
Sheffield Graz, May 2004
2
Contents
  • The Web, presentation, and syndication
  • A Semantic Web for eCulture
  • annoy half the audience
  • annoy the other half
  • eCulture, metadata and human language
  • motivation
  • Information Extraction quantified language
    computing
  • MUMIS, GATE, ...
  • Cultural memory is not a luxury

3
Syndication and Mediation
  • The web promotes diversity, but also
    fragmentation
  • Original web separate content and presentation
    (this is a header, not set in 20 point bold
    font)
  • Now many incompatible/inaccessible interfaces
  • Memory Institutions (museums, libraries,
    archives) need to
  • pool their impact syndication in networked
    communities
  • support repurposable content
  • Therefore data must be presentation independent
  • Candidate technologies DC, CIDOC, XML, RSS,
    RDF, OWL (semantic web)...

4
Semantic Web (1)
  • Memory Institutions (museums, libraries,
    archives) host massively diverse content
  • Fortunately, the differences are primarily at the
    level of data structure and syntax. Significant
    conceptual overlaps exist between the descriptive
    schema used by memory institutions elemental
    concepts such as objects, people, places, events,
    and the interrelationships between them are
    almost universal. Building semantic bridges
    between museums, libraries and archives The
    CIDOC Conceptual Reference Model, T. Gill, April
    2004
  • Therefore we can add a semantic metadata layer to
    provide generalised inter-institution resource
    location
  • Syndication and mediation for free!

5
Semantic Web (2)good news and bad news
  • The good news SW focus of AI and metadata work
  • The bad news AI always fails
  • How does the machine tell the difference between
    Mother Theresa is a saint and Tony Blair is a
    saint?(Or, who tells Google which statement is
    important?)
  • Other web users do, by linking (also cf. Amazon)
  • Two solutions to the AI problem
  • allow curators and users to build their own
    (simple specific models can succeed, but the cost
    may be too high)
  • use recommender systems to make the user a
    curators assistant (researchers and students may
    barter for access)
  • Any route to searchable content!

6
IT context the Knowledge Economy and Human
Language
  • Gartner, December 2002
  • taxonomic and hierachical knowledge mapping and
    indexing will be prevalent in almost all
    information-rich applications
  • through 2012 more than 95 of human-to-computer
    information input will involve textual language
  • A contradiction
  • to deal with the information deluge we need
    formal knowledge in semantics-based systems
  • our archived history is in informal and ambiguous
    natural language
  • The challenge to reconcile these two phenomena

7
HLT Closing the Loop
KEY MNLG Multilingual Natural Language
GenerationOIE Ontology-aware Information
ExtractionAIE Adaptive IECLIE Controlled
Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OIE
(A)IE
ControlledLanguage
CLIE
8
Information Extraction
  • Information Extraction (IE) pulls facts and
    structured information from the content of large
    text collections.
  • Contrast IE and Information Retrieval
  • NLP history from NLU to IE
  • Progress driven by quantitative measures
  • MUC Message Understanding Conferences
  • ACE Advanced Content Extraction

9
IE Example
  • The shiny red rocket was fired on Tuesday. It is
    the brainchild of Dr. Big Head. Dr. Head is a
    staff scientist at We Build Rockets Inc.
  • NE "rocket", "Tuesday", "Dr. Head, "We Build
    Rockets"
  • CO"it" rocket "Dr. Head" "Dr. Big Head"
  • TE the rocket is "shiny red" and Head's
    "brainchild".
  • TR Dr. Head works for We Build Rockets Inc.
  • ST rocket launch event with various participants

10
Performance levels
  • (Extensive quantitative evaluation since early
    90s mainly on text, ASR now also video OCR)
  • Vary according to text type, domain, scenario,
    language
  • NE up to 97 (tested in English, Spanish,
    Japanese, Chinese, others)
  • CO 60-70 resolution
  • TE 80
  • TR 75-80
  • ST 60 (but human level may be only 80)

11
Ontology-based IE
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
12
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances before
Bush
Classesinstances after
13
An example the MUMIS project
  • Multimedia Indexing and Searching Environment
  • Composite index of a multimedia programme from
    multiple sources in different languages
  • ASR, video processing, Information Extraction
    (Dutch, English, German), merging, user interface
  • University of Twente/CTIT, University of
    Sheffield, University of Nijmegen, DFKI, MPI,
    ESTEAM AB, VDA
  • An important experimental result multiple
    sources for same events can improve extraction
    quality
  • PrestoSpace applications in news and sports
    archiving

14
Semantic Query
Not goal Beckham (includes e.g. missed goals,
or this was not a goal) Instead goal events
with scorer David Beckham
15
The results England win!
16
GATE, a General Architecture for Text Engineering
is...
  • An architecture A macro-level organisational
    picture for LE software systems.
  • A framework For programmers, GATE is an
    object-oriented class library that implements the
    architecture.
  • A development environment For language engineers,
    a graphical development environment.
  • GATE comes with...
  • Free components, and wrappers for other peoples
    stuff
  • Tools for evaluation visualise/edit
    persistence IR IE dialogue ontologies etc.
  • Free software (LGPL) at http//gate.ac.uk/download
    /
  • Used by thousands of people at hundreds of sites

17
A bit of a nuisance (GATE users)
  • Thousands of users at hundreds of
  • sites. A representative sample
  • the American National Corpus project
  • the Perseus Digital Library project, Tufts
    University, US
  • Longman Pearson publishing, UK
  • Merck KgAa, Germany
  • Canon Europe, UK
  • Knight Ridder, US
  • BBN (leading HLT research lab), US
  • SMEs inc. Sirma AI Ltd., Bulgaria
  • Stanford, Imperial College, London, the
    University of Manchester, UMIST, the University
    of Karlsruhe, Vassar College, the University of
    Southern California and a large number of other
    UK, US and EU Universities
  • UK and EU projects inc. MyGrid, CLEF, dotkom,
    AMITIES, Cub Reporter, EMILLE, Poesia...
  • GATE team projects. Past
  • Conceptual indexing MUMIS automatic semantic
    indices for sports video
  • MUSE, cross-genre entitiy finder
  • HSL, Health-and-safety IE
  • Old Bailey collaboration with HRI on 17th
    century court reports
  • Multiflora plant taxonomy text analysis for
    biodiversity research e-science
  • ACE / TIDES Arabic, Chinese NE
  • JHU summer w/s on semtagging
  • EMILLE S. Asian languages corpus
  • hTechSight chemical eng. K. portal
  • Present
  • Advanced Knowledge Technologies 12m UK five
    site collaborative project
  • SEKT Semantic Knowledge Technology
  • PrestoSpace MM Preservation/Access
  • KnowledgeWeb Semantic Web
  • Future
  • New eContent project LIRICS

18
GATE infrastructure for semantic metadata
extraction
  • Combines learning and rule-based methods (new
    work on mixed-initiative learning)
  • Allows combination of IE and IR
  • Enables use of large-scale linguistic resources
    for IE, such as WordNet
  • Supports ontologies as part of IE applications -
    Ontology-Based IE
  • Supports languages from Hindi to Chinese, Italian
    to German

19
PrestoSpace Semantics Architecture
IE
...
Formal Text
Formal Text
Formal Text
Final Annotations
IE
Formal Text
IT
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
EN
Text Sources
IE
Multilingual Conceptual Q A
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
Formal Text
AV Signals
Formal Text
Signal md, Transcr-iptions
ASR, etc.
20
Memory is not a luxury
  • C21st all the C20th mistakes but bigger
    better?
  • If you dont know where youve been, how can you
    know where youre going?
  • Archives ammunition in the war on ignorance
  • Ammunition is useless if you cant find it new
    technology must make our history accessible to
    all, for all our futures

21
Links
  • This talk
  • http//gate.ac.uk/sale/talks/eculture-graz-may200
    4.ppt
  • Related projects
Write a Comment
User Comments (0)
About PowerShow.com