THE DIATHESIS NEWSPAPER DIGITIZATION SUITE - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

Description:

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE. Foundation of Research and Technology ... The DIATHESIS digitization suite encapsulates a digitization strategy towards ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 30
Provided by: geom3
Category:

less

Transcript and Presenter's Notes

Title: THE DIATHESIS NEWSPAPER DIGITIZATION SUITE


1
THE DIATHESIS NEWSPAPER DIGITIZATION SUITE
  • Foundation of Research and Technology
  • Institute of Computer Science
  • Centre for Cultural Informatics

Heraklion, Crete, Greece
Martin Doerr, Georgios Markakis, Maria Theodoridou
2
About DIATHESIS
  • Diathesis is a newspaper digitization suite whose
    primary purpose is the digitization,
    classification and dissemination of archival
    newspaper material.
  • It was originally used for the digitization of
    the Vikelaia Municipal Librarys newspaper
    collection (1890-1960) at Heraclion, Crete. It
    has evolved as an independent digitization suite
    since.
  • Used in other projects as well (Filekpedeytiki
    Etairia Athens, Greece The AYGHI newspaper)

3
The Problem
  • Historical newspapers are one of the most
    signicant source of information for researchers
    due to the wealth of information they provide
    regarding every aspect of everyday political,
    social and intellectual life.
  • Access to this type of archival material is
    usually obstructed by the following factors
  • In order to protect the archival material from
    potential damage some archives prohibit the
    access to the largest part of their collection.
  • Direct contact with the original archival
    material constitutes a potential health hazard
    (due to dust and fungi).
  • The lack of indexes to newspapers combined with
    the vastness of information contained in them
    makes research a very time consuming task.
  • Many archives adopted digitization of newspapers
    as a straightforward method to deal with the
    above problems. Digitized material is easier to
    preserve and much easier to distribute via the
    Web.
  • However, conversion of archival material into a
    digital image format (i.e. JPEG, TIFF, PDF or
    DJVU) does not solve the problem of rapid access
    to this material.
  • Digitization itself is inadequate if it does not
    provide the means of rapidly accessing the
    digitized material in a timely and accurate
    manner (also known as the searchability issue).

4
Current State of the Art newspaper Digitization
Practices
  • Currently there are three main approaches for
    rendering newspaper archival material searchable
  • The Physical Features Based Approach
  • The OCR Based Full Text Indexing Approach.
  • The Conceptual Classification (Ontology Based)
    Approach.

5
The physical features based classification
approach.
  • Newspapers are classified using a basic set of
    metadata regarding physical features of the
    original material (number of issue, date of
    publication, newspaper name, number of pages
    etc).
  • Advantages
  • Simple to implement.
  • Disadvantages
  • The final user is unable to conduct full-text
    searches on an article or issue level basis.
  • The final outcome of the digitization effort
    resembles more a browsing mechanism.
  • There is no explicitly defined conceptual
    structure of the archive.
  • Institutions
  • Anno Austrian newspapers online project
    (http//deposit.ddb.de/online/exil/exil.htm).
  • Exilpresse digital. deutsche exilzeitschriften
    1933-1945" project (http//deposit.ddb.de/online/e
    xil/exil.htm).
  • Denmark Digitaliserede danske aviser 1759-1865
    (http//www.statsbiblioteket.dk).

6
The OCR based Full Text Indexing Approach.
  • Automatic digitization approaches that make use
    of OCR analysis of digitized newspapers. Full
    Text Indexing techniques are currently considered
    to be the state of the art in the area of
    newspaper digitization and this is mainly for the
    following reasons
  • Creation of searchable full - text index via OCR
    is a much faster process compared to the manual
    creation of metadata.
  • Separation of searchability and readability.
  • It is possible to conduct searches at a
    page/issue/article level basis.
  • The search is conducted via keywords in a manner
    that is familiar to the average user of
    contemporary Web Search engines.
  • Efficient content dissemination over the Web.
  • Disadvantages
  • Well known precision/recall issues.
  • Newspaper archives are not as chaotic as the Web.
  • The search of information in OCR based
    information retrieval systems is conceptually
    blind.
  • The import process a computationally expensive
    procedure.

7
The OCR based Full Text Indexing Approach.
  • Institutions adopting this approach
  • British library online newspaper archive
    (http//www.uk.olivesoftware.com/).
  • The Brooklyn Daily Eagle online
    (http//www.brooklynpubliclibrary.org/eagle/).
  • Northern New Nork historical newspapers
    (http//news.nnyln.net/).
  • Utah Digital Newspapers (http//www.lib.utah.edu/d
    igital/unews/).
  • Historical newspapers in Washington
    (http//www.secstate.wa.gov/history/newspapersname
    .aspx).
  • To mention just a few

8
The conceptual classification approach.
  • The conceptual classification approach overcomes
    many of the above weaknesses by enabling the user
    to perform a knowledge engineering task upon the
    already digitized material via the use of
    ontologies.
  • An Ontology "the specifcation of ones
    conceptualization of a knowledge domain".
  • Advantages
  • Ontologies are used to express a specific
    conceptual view over the digitized material.
  • The use of top level ontologies guarantees to a
    certain extent the semantic interoperability
    among different archives.
  • The user may use concepts that classify the
    document that are not initially contained within
    the document itself.
  • Disadvantages
  • Given the density of information in a newspaper,
    production of metadata is a notoriously time
    consuming task (knowledge engineering
    bottleneck).
  • It is almost impossible to manually define all
    the semantic relations or entities contained even
    in a single article in a timely manner.

9
The DIATHESIS Approach a hybrid approach
  • This system attempts to implement a realistic
    conceptual classification approach by combining
    the best elements from the three approaches
    mentioned above
  • It permits searches on a newspaper issue basis
    (newspaper issue name, number, publication date)
    in a similar manner to the physical features
    based approach.
  • It permits searches on an article level basis via
    the use of full text queries in a similar manner
    to the OCR based Full Text Indexing Approach.
  • It permits searches on an article level basis via
    the semantic relationships assigned to each
    segment.
  • It permits searches that combine all of the above
    elements.
  • The system DOES not attempt to create a complete
    semantic structure that includes all the semantic
    relationships and entities (Actors, Places)
    described in the text. Instead it focuses to the
    creation a coherent semantic backbone that can be
    easily enriched with semantic relations.
  • DIATHESIS is using CIDOC CRM as an underlying
    ontology.

10
Aims of DIATHESIS
  • To render the digitized newspapers searchable on
    a document/article level basis.
  • To exploit the use of OCR technology in order to
    enable full text search in a newspaper
    collection.
  • To combine full text search with user-defined
    metadata based search on a document and article
    level basis in order to enhance the overall
    precision factor of the system.
  • To provide visualization facilities and an
    ergonomic interface for
  • The timely completion of metadata according to a
    set of predefined thesauri hierarchies.
  • The browsing of the digitized newspaper
    collection given a set of predefined thesauri
    hierarchies.
  • To deal with issues of semantic interoperability
    of digitized material (conformance to
    international standards).
  • To create a robust semantic backbone that will
    allow the full implementation of the CIDOC CRM
    Model.

11
About CIDOC
  • What is the CIDOC Conceptual Reference Model?
  • An Object Oriented Ontology of about 80 classes
    and 130 properties for cultural and natural
    history
  • CRM instances can be encoded in many forms
    RDBMS, ooDBMS, XML, RDF(S), OWL.
  • Accepted as ISO-21127 in June 2005
  • The CRM
  • Is not a metadata standard
  • It is meant to become our language for semantic
    interoperability,
  • It is a Conceptual Reference Model for analyzing
    and designing cultural information systems
  • Is limited to the underlying semantics of
    database schemata and document structures used in
    cultural heritage and museum documentation
  • Does not define the terminology used to document
    these data structures
  • Does not say what cultural institutions should
    document
  • Aims to explain the logic of what they actually
    do document

12
(No Transcript)
13
An Example Hierarchy E70 Stuff (Thing)
14
CIDOC Example (1) Modeling an Activity
February 1945
P82 at some time within
P7 took place at
P11 participated in
E7 Activity
Crimea Conference
P86 falls within
P67 is referred to by
E65 Creation Event

P81 ongoing throughout
P14 performed
P94 has created
15
CIDOC Example (2) Describing a composite
artifact
16
CIDOC-CRM DIATHESIS implementation
Issue/Segments Relationships
17
CIDOC-CRM DIATHESIS implementation Issue
Physical Features
18
CIDOC-CRM DIATHESIS implementation Activity
References
19
Thesauri Hierarchies
20
CIDOC based newspaper annotation
CIDOC CRM Core Ontology
Integration by Factual Relations
Donald Johanson
Discovery of Lucy
Johanson's Expedition
real world nodes (KOS)
Lucy
Hadar
Ethiopia
Benaki Museum
Documents in Digital Libraries
21
The System Architecture Software Components
Apache Tomcat Application Server
Newspaper Digitization Suite
Diathesis Administrator
Diathesis Annotation Mechanism
DIATHESIS Web Search
Database
SIS-TMS Thesaurus Management System
Client Side
Server Side
22
The System Architecture Workflow View
23
The user interface
  • FEATURES
  • Fully Web Based.
  • Simple to use / Easy to learn.
  • Intelligent Upload / Download Mechanism.
  • Workflow Control .
  • Data Loss Prevention Mechanism (Temporary Local
    Storage and Data Recovery).
  • Flexible and Ergonomic Completion of Metadata
    Fields.
  • Automatic Highlighting of keywords in OCR Text
    (Actors, Places).
  • Use of SVG thesauri hierarchies for the timely
    completion of Vocabulary Reserved Metadata fields.

24
The user interface
DIATHESIS
End User Search Mechanism
Annotation Mechanism
Administrator
Search for Subjects
Usage Stats
Search for Issues
Mass Import
System Configuration
25
Demonstration Annotation Interface
26
Demonstration End User Search Mechanism
27
Future Directions
  • Enrich the metadata creation process with
    Information Extraction Techniques.
  • Expand the suite with complementary Deep Semantic
    Annotation Capabilities (Semantic Wiki)

PHASE 1
PHASE 2
PHASE 3
DIATHESIS Semantic Wiki
Information Extraction Techniques
Material Preprocessing Phase
Shallow Semantic Annotation metadata production
phase.
Deep Semantic Annotation full CIDOC
implementation phase
28
Conclusions
  • The use of OCR technology in newspaper
    digitization practices is a hot new technology.
    However it is not capable to deal with a plethora
    of issues.
  • Deep Semantic annotation via Semantic Web
    technologies is a promising future trend. CIDOC
    CRM provides the theoretical means to achieve
    this. The problem is how to implement it.
    Creation of deep semantic relationships that
    exist within the boundaries of a single newspaper
    issue is a time consuming , and therefore
    expensive task.
  • The DIATHESIS digitization suite encapsulates a
    digitization strategy towards the creation of a
    vast semantic network of factual relationships
    between CIDOC entities while effectively dealing
    with the following issues
  • Digitization and Storage of Newspaper Material
  • Rendering digitized material searchable on an
    issue/article level basis via the use of
    metadata, thesauri hierarchies and full text
    queries.
  • Create a semantic backbone that can be used by
    future implementations.
  • The next step Link the DIATHESIS semantic
    backbone with a Semantic Wiki.

29
  • Thank You!

geomark_at_ics.forth.gr martin_at_ics.forth.gr
Write a Comment
User Comments (0)
About PowerShow.com