PreScan Preservation Scanner towards automating the ingestion process - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

PreScan Preservation Scanner towards automating the ingestion process

Description:

FORTH-ICS. Presentation by: Yannis Tzitzikas, Yannis Marketakis. 2. Outline ... PreScan is a tool (developed by FORTH-ICS during the third year of the project) ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 20
Provided by: Gia109
Category:

less

Transcript and Presenter's Notes

Title: PreScan Preservation Scanner towards automating the ingestion process


1
PreScan (Preservation Scanner)towards automating
the ingestion process
  • FORTH-ICS
  • Presentation by Yannis Tzitzikas, Yannis
    Marketakis

2
Outline
  • Motivation Background
  • The architecture of PreScan
  • Scanner
  • Metadata Extactor
  • Repository Manager
  • Controller
  • Time Performance
  • Related Works and Systems
  • Future Extensions
  • Software Releases

3
Motivation
  • The creation and maintenance of metadata is a
    laborious task that does not always pay off
    immediately.
  • There is a need for tools that automate as much
    as possible the creation and curation of
    preservation metadata.
  • PreScan is a tool (developed by FORTH-ICS during
    the third year of the project) for automating the
    ingestion phase.
  • It can bind together automatically extracted
    embedded metadata with manually provided
    metadata, and dependency management services.
  • In addition it offers some features for keeping
    the metadata repository up-to-date.

4
Background
  • Metadata can be stored either
  • internally, i.e. in the same file as the data,
    and these are called embedded
  • or externally, i.e. in a separate
    file/repository, these are called detached
  • Both approaches have advantages and
    disadvantages.
  • One benefit of the embedded metadata is that they
    are transferred with the data and thus their
    access and manipulation is straightforward.
    However embedded metadata can create redundancies
    and this approach does not allow holding and
    managing all metadata together.
  • On the other hand, if the metadata are detached,
    then this means that they are stored in a special
    repository. This approach has less redundancy, we
    can support efficient metadata search, and we can
    manipulate them efficiently, e.g. we can perform
    bulk metadata updates. However, the way metadata
    are linked to data should be treated with care as
    inconsistencies may arise.

5
PreScan (Preservation Scanner)
  • Components
  • scanner scanning the file system
  • metadata extractor extracts the embedded
    metadata of the scanned files
  • repository manager for storing and managing
    these metadata
  • controller controls the entire process and
    metadata life-cycle.

6
Component Scanner
  • It acts like the scanner of an AntiVirus program
  • The user defines
  • the folders that should be scanned.
  • where metadata should be stored
  • It also allows the re-Scanning of projects
  • That is aware of
  • file movements/additions
  • human provided metadata

7
Component ExtractorExamples of extracted
metadata
  • It extracts the embedded metadata of the scanned
    files.
  • Currently it relies on JHOVE, although more
    extractors could be plugged in.

Some of the supported file-types and metadata
8
Component Controller
  • It controls the entire process and metadata
    life-cycle.
  • It offers a re-scan option that ensures that the
    manually provided metadata will not be lost after
    the next scan
  • To this end it tries to identify the new files,
    the files that were deleted and the files that
    changed location since the previous scan.
  • The identified file movements are shown to the
    user in order to confirm the change (and thus the
    association of the manually provided metadata
    with the updated extracted metadata)

9
Component Repository ManagerMore on the
available Choices
  • The Repository Manager is responsible for
    storing, querying and updating the metadata
    records.
  • The metadata record of a file includes both the
    extracted and the human-provided metadata.
  • There are more than one choices regarding where
    these metadata are stored. The options that are
    currently supported are listed below (they are
    not mutually exclusive)
  • (SF) For each scanned file its metadata record is
    created and stored in a Specific Folder specified
    by the user.
  • (OF) For each scanned file its metadata record is
    created and stored in the Original Folder (the
    same folder with the scanned file).
  • (KB) The contents of the metadata records of the
    scanned files are stored in a Semantic Web-based
    Knowledge Base.

10
Component Repository ManagerMore on the KB
choice
  • Architecture of Ontologies and Metadata

The ontology of GapManager
11
Component Repository ManagerMore on the KB
choice (cont)
12
Component Repository ManagerRDF Exporter
P4 has time span
P1 is identified by
S11B was output of
P3 has note
S2B was source for
P43 has dimension
P4 has time span
P2 has type
P90 has value
P91 has unit
13
PreScan Time Performance
  • It takes about 10 hours for 100 thousands files

14
Synopsis
  • PreScan is quite similar in spirit with the
    crawlers of Web Search Engines. In our case we
    scan the file system, we extract the embedded
    metadata and build an index. The difference in
    our case is that we need to support (a) more
    advanced extraction services, (b) manual addition
    of metadata, (c) more expressive representation
    frameworks for keeping and exploiting the
    metadata (i.e. SW languages), (d) rescans that do
    not start from scratch but exploit the previous
    status of the index, and (e) associations with
    external sources (e.g. registries).
  • In brier, PreScan can aid automating the
    ingestion process for file system-based archives.

15
Related Works and Systems
16
Future Steps
  • Extensions
  • Pluggable additional Metadata Extractors (for
    recognizing more formats)
  • Key Extension
  • Flexible generation of CIDOC CRM Digital instances

17
Software Releases
  • Alpha Release (June 2009)
  • Repository Manager (SF, OF) with XML output
  • Known bugs
  • The progress bar sometimes does not progress
    (although scanning progresses)
  • For some files the extractor crashes (this is an
    bug of the extractor, i.e. JHOVE)
  • Beta Release (September 2009)
  • Generation of instances of CIDOC CRM Digital
  • That would allow browsing the KB repository
    though the GUI of GapManager
  • URL
  • http//wiki.casparpreserves.eu/bin/view/Main/PreSc
    an

18
Developers and Contact points
  • Main Developers
  • Yannis Marketakis, Makis Tzanakis
  • Contact person
  • Yannis Tzitzikas
  • PreScan Web Page
  • http//www.ics.forth.gr/PreScan

19
Thanks for your attention
Write a Comment
User Comments (0)
About PowerShow.com