THE DIATHESIS NEWSPAPER DIGITIZATION SUITE - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

Description:

THE DIATHESIS NEWSPAPER DIGITIZATION SUITE. Foundation of Research and Technology ... The DIATHESIS digitization suite encapsulates a digitization strategy towards ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 30

Provided by: geom3

Category:

more less

Transcript and Presenter's Notes

Title: THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

1
THE DIATHESIS NEWSPAPER DIGITIZATION SUITE

Foundation of Research and Technology
Institute of Computer Science
Centre for Cultural Informatics

Heraklion, Crete, Greece
Martin Doerr, Georgios Markakis, Maria Theodoridou
2
About DIATHESIS

Diathesis is a newspaper digitization suite whose
primary purpose is the digitization,
classification and dissemination of archival
newspaper material.
It was originally used for the digitization of
the Vikelaia Municipal Librarys newspaper
collection (1890-1960) at Heraclion, Crete. It
has evolved as an independent digitization suite
since.
Used in other projects as well (Filekpedeytiki
Etairia Athens, Greece The AYGHI newspaper)

3
The Problem

Historical newspapers are one of the most
signicant source of information for researchers
due to the wealth of information they provide
regarding every aspect of everyday political,
social and intellectual life.
Access to this type of archival material is
usually obstructed by the following factors
In order to protect the archival material from
potential damage some archives prohibit the
access to the largest part of their collection.
Direct contact with the original archival
material constitutes a potential health hazard
(due to dust and fungi).
The lack of indexes to newspapers combined with
the vastness of information contained in them
makes research a very time consuming task.
Many archives adopted digitization of newspapers
as a straightforward method to deal with the
above problems. Digitized material is easier to
preserve and much easier to distribute via the
Web.
However, conversion of archival material into a
digital image format (i.e. JPEG, TIFF, PDF or
DJVU) does not solve the problem of rapid access
to this material.
Digitization itself is inadequate if it does not
provide the means of rapidly accessing the
digitized material in a timely and accurate
manner (also known as the searchability issue).

4
Current State of the Art newspaper Digitization
Practices

Currently there are three main approaches for
rendering newspaper archival material searchable
The Physical Features Based Approach
The OCR Based Full Text Indexing Approach.
The Conceptual Classification (Ontology Based)
Approach.

5
The physical features based classification
approach.

Newspapers are classified using a basic set of
metadata regarding physical features of the
original material (number of issue, date of
publication, newspaper name, number of pages
etc).
Advantages
Simple to implement.
Disadvantages
The final user is unable to conduct full-text
searches on an article or issue level basis.
The final outcome of the digitization effort
resembles more a browsing mechanism.
There is no explicitly defined conceptual
structure of the archive.
Institutions
Anno Austrian newspapers online project
(http//deposit.ddb.de/online/exil/exil.htm).
Exilpresse digital. deutsche exilzeitschriften
1933-1945" project (http//deposit.ddb.de/online/e
xil/exil.htm).
Denmark Digitaliserede danske aviser 1759-1865
(http//www.statsbiblioteket.dk).

6
The OCR based Full Text Indexing Approach.

Automatic digitization approaches that make use
of OCR analysis of digitized newspapers. Full
Text Indexing techniques are currently considered
to be the state of the art in the area of
newspaper digitization and this is mainly for the
following reasons
Creation of searchable full - text index via OCR
is a much faster process compared to the manual
creation of metadata.
Separation of searchability and readability.
It is possible to conduct searches at a
page/issue/article level basis.
The search is conducted via keywords in a manner
that is familiar to the average user of
contemporary Web Search engines.
Efficient content dissemination over the Web.
Disadvantages
Well known precision/recall issues.
Newspaper archives are not as chaotic as the Web.
The search of information in OCR based
information retrieval systems is conceptually
blind.
The import process a computationally expensive
procedure.

7
The OCR based Full Text Indexing Approach.

Institutions adopting this approach
British library online newspaper archive
(http//www.uk.olivesoftware.com/).
The Brooklyn Daily Eagle online
(http//www.brooklynpubliclibrary.org/eagle/).
Northern New Nork historical newspapers
(http//news.nnyln.net/).
Utah Digital Newspapers (http//www.lib.utah.edu/d
igital/unews/).
Historical newspapers in Washington
(http//www.secstate.wa.gov/history/newspapersname
.aspx).
To mention just a few

8
The conceptual classification approach.

The conceptual classification approach overcomes
many of the above weaknesses by enabling the user
to perform a knowledge engineering task upon the
already digitized material via the use of
ontologies.
An Ontology "the specifcation of ones
conceptualization of a knowledge domain".
Advantages
Ontologies are used to express a specific
conceptual view over the digitized material.
The use of top level ontologies guarantees to a
certain extent the semantic interoperability
among different archives.
The user may use concepts that classify the
document that are not initially contained within
the document itself.
Disadvantages
Given the density of information in a newspaper,
production of metadata is a notoriously time
consuming task (knowledge engineering
bottleneck).
It is almost impossible to manually define all
the semantic relations or entities contained even
in a single article in a timely manner.

9
The DIATHESIS Approach a hybrid approach

This system attempts to implement a realistic
conceptual classification approach by combining
the best elements from the three approaches
mentioned above
It permits searches on a newspaper issue basis
(newspaper issue name, number, publication date)
in a similar manner to the physical features
based approach.
It permits searches on an article level basis via
the use of full text queries in a similar manner
to the OCR based Full Text Indexing Approach.
It permits searches on an article level basis via
the semantic relationships assigned to each
segment.
It permits searches that combine all of the above
elements.
The system DOES not attempt to create a complete
semantic structure that includes all the semantic
relationships and entities (Actors, Places)
described in the text. Instead it focuses to the
creation a coherent semantic backbone that can be
easily enriched with semantic relations.
DIATHESIS is using CIDOC CRM as an underlying
ontology.

10
Aims of DIATHESIS

To render the digitized newspapers searchable on
a document/article level basis.
To exploit the use of OCR technology in order to
enable full text search in a newspaper
collection.
To combine full text search with user-defined
metadata based search on a document and article
level basis in order to enhance the overall
precision factor of the system.
To provide visualization facilities and an
ergonomic interface for
The timely completion of metadata according to a
set of predefined thesauri hierarchies.
The browsing of the digitized newspaper
collection given a set of predefined thesauri
hierarchies.
To deal with issues of semantic interoperability
of digitized material (conformance to
international standards).
To create a robust semantic backbone that will
allow the full implementation of the CIDOC CRM
Model.

11
About CIDOC

What is the CIDOC Conceptual Reference Model?
An Object Oriented Ontology of about 80 classes
and 130 properties for cultural and natural
history
CRM instances can be encoded in many forms
RDBMS, ooDBMS, XML, RDF(S), OWL.
Accepted as ISO-21127 in June 2005
The CRM
Is not a metadata standard
It is meant to become our language for semantic
interoperability,
It is a Conceptual Reference Model for analyzing
and designing cultural information systems
Is limited to the underlying semantics of
database schemata and document structures used in
cultural heritage and museum documentation
Does not define the terminology used to document
these data structures
Does not say what cultural institutions should
document
Aims to explain the logic of what they actually
do document

12
(No Transcript)
13
An Example Hierarchy E70 Stuff (Thing)
14
CIDOC Example (1) Modeling an Activity
February 1945
P82 at some time within
P7 took place at
P11 participated in
E7 Activity
Crimea Conference
P86 falls within
P67 is referred to by
E65 Creation Event

P81 ongoing throughout
P14 performed
P94 has created
15
CIDOC Example (2) Describing a composite
artifact
16
CIDOC-CRM DIATHESIS implementation
Issue/Segments Relationships
17
CIDOC-CRM DIATHESIS implementation Issue
Physical Features
18
CIDOC-CRM DIATHESIS implementation Activity
References
19
Thesauri Hierarchies
20
CIDOC based newspaper annotation
CIDOC CRM Core Ontology
Integration by Factual Relations
Donald Johanson
Discovery of Lucy
Johanson's Expedition
real world nodes (KOS)
Lucy
Hadar
Ethiopia
Benaki Museum
Documents in Digital Libraries
21
The System Architecture Software Components
Apache Tomcat Application Server
Newspaper Digitization Suite
Diathesis Administrator
Diathesis Annotation Mechanism
DIATHESIS Web Search
Database
SIS-TMS Thesaurus Management System
Client Side
Server Side
22
The System Architecture Workflow View
23
The user interface

FEATURES
Fully Web Based.
Simple to use / Easy to learn.
Intelligent Upload / Download Mechanism.
Workflow Control .
Data Loss Prevention Mechanism (Temporary Local
Storage and Data Recovery).
Flexible and Ergonomic Completion of Metadata
Fields.
Automatic Highlighting of keywords in OCR Text
(Actors, Places).
Use of SVG thesauri hierarchies for the timely
completion of Vocabulary Reserved Metadata fields.

24
The user interface
DIATHESIS
End User Search Mechanism
Annotation Mechanism
Administrator
Search for Subjects
Usage Stats
Search for Issues
Mass Import
System Configuration
25
Demonstration Annotation Interface
26
Demonstration End User Search Mechanism
27
Future Directions

Enrich the metadata creation process with
Information Extraction Techniques.
Expand the suite with complementary Deep Semantic
Annotation Capabilities (Semantic Wiki)

PHASE 1
PHASE 2
PHASE 3
DIATHESIS Semantic Wiki
Information Extraction Techniques
Material Preprocessing Phase
Shallow Semantic Annotation metadata production
phase.
Deep Semantic Annotation full CIDOC
implementation phase
28
Conclusions

The use of OCR technology in newspaper
digitization practices is a hot new technology.
However it is not capable to deal with a plethora
of issues.
Deep Semantic annotation via Semantic Web
technologies is a promising future trend. CIDOC
CRM provides the theoretical means to achieve
this. The problem is how to implement it.
Creation of deep semantic relationships that
exist within the boundaries of a single newspaper
issue is a time consuming , and therefore
expensive task.
The DIATHESIS digitization suite encapsulates a
digitization strategy towards the creation of a
vast semantic network of factual relationships
between CIDOC entities while effectively dealing
with the following issues
Digitization and Storage of Newspaper Material
Rendering digitized material searchable on an
issue/article level basis via the use of
metadata, thesauri hierarchies and full text
queries.
Create a semantic backbone that can be used by
future implementations.
The next step Link the DIATHESIS semantic
backbone with a Semantic Wiki.