Collection and KnowledgeBased Persistent Archives at SDSC - PowerPoint PPT Presentation

1 / 95
About This Presentation
Title:

Collection and KnowledgeBased Persistent Archives at SDSC

Description:

tagged E-mail using XML syntax (6 required, 13optional, 1000 user-defined tags) ... any tagged data, where tags are treated as information attributes ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 96
Provided by: bertramlu
Category:

less

Transcript and Presenter's Notes

Title: Collection and KnowledgeBased Persistent Archives at SDSC


1
Collection- and Knowledge-Based Persistent
Archives at SDSC
  • Bertram Ludäscher
  • LUDAESCH_at_SDSC.EDU
  • San Diego Supercomputer Center
  • University of California, San Diego

work sponsored by the National Archives and
Records Administration and Advanced Research
Projects Agency (NARA) and NHPRC
2
Content Overview
  • Part I
  • SDSCs Persistent Archives Initiative
    (material from Reagan Moore, Deputy
    Director SDSC, Data Intensive Computing
    Environments)
  • Example/Case Study
    Wrapping Websites into XML for
    Archival (material from Valter
    Crescenzi, visiting from U Roma 3)
  • Part II
  • From Collection-Based to Self-Validating
    Knowledge-Based Archives
    (joint work
    with Reagan Moore and Richard Marciano)
  • Running Example The Senate Collection

3
Data Intensive Computing Environments
  • Staff
  • Reagan Moore
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Students - GSRA
  • Martin Kuhl
  • Liying Sui
  • Yang Yu
  • Valter Crescenzi
  • Students - Undergrad Interns
  • Peter Shin
  • Roman Olshanowsky
  • Shabbar Tambawala
  • Pratik Mukhopadhyay
  • /- NN

4
Part I
  • Overview
  • SDSCs Persistent Archival Approach
  • Case Study
  • Wrapping Websites into XML for Archival

5
Persistent Archive Goals/Objectives
  • Manage digital objects for the life of the
    republic
  • Maintain ability to discover and access digital
    objects while the supporting hardware and
    software systems evolve

6
Example Archive Components
  • Storage system for storing the digital objects
  • e.g. HPSS (tape silos with disk cache) or SANs
    (Storage Area Networks)
  • Database for managing a collection that
    represents the digital objects
  • e.g. an object-relational DBMS
  • Web server for discovering and displaying the
    digital objects
  • e.g. CGI scripts with helper applications

7
Example Archive
  • Assyrian clay tablets
  • Provided long term storage, but limited volume
  • Characterize an archive by the bandwidth it
    provides for transporting data into the future
    and its archival capacity

8
Risks and Challenges of Persistent Archives
  • Each of the software and hardware systems may
    become obsolete
  • the storage media may degrade
  • the storage system may become obsolete
  • the database backups may become obsolete, with no
    way to recover the collection (structure)
  • the digital object formats may become obsolete,
    with no helper application that can read them

9
Good News Persistent Archives are Possible
  • The Archivist (archival engineer) is in control
  • Archivist gets to decide on the persistence
    policies
  • how to minimize risk
  • how to minimize cost
  • when to use new technology

10
Persistent Archive Bandwidth (Migration Speed)
  • Ability/Necessity to transport data is
  • (size of the archive) / (media lifetime)
  • the larger the archive and/or shorter the
    lifetime,
  • the higher the required bandwidth
    (migration speed)
  • Example size(archive) 100TB media_lifetime
    5 years
  • ability/necessity to migrate 20 TB/year to
    avoid data loss!
  • Clay tablets provided a long media lifetime, but
    a very small storage capacity
  • effective bandwidth was a byte/day

11
Concept 1
  • Persistent Archive is a Migration Mechanism
  • Since the amount of data is increasing
    exponentially, the archive capacity must increase
    correspondingly
  • Migrate to new technology to get to higher
    sustainable Archive Bandwidths

12
Data Scales
  • Megabyte - one million bytes
  • Digital content of a book
  • Gigabyte - one thousand MBs
  • Terabyte - one thousand GBs
  • Digital content of a film
  • Petabyte - one thousand TBs
  • Amount of tape produced in 1994
  • Exabyte - one thousand PBs
  • Data produced per year

13
Archive Bandwidth
  • If you wait too long to migrate, you will be
    unable to read the data from the archive before
    the media degrades
  • There is a maximum capacity for any choice of
    archive media, for a given capital investment in
    media read devices

14
Archive Capacities for a 2-tape Drive System
15
SDSC Archive
  • Currently store 240 TeraBytes, with the capacity
    of the system being 500 TeraBytes
  • 16 tape drives in 3 silos
  • Migrated digital holdings through
  • three different storage systems
  • five different computer systems
  • six different types of tapemedia

16
Reasons for Migration
  • Avoid data loss
  • manage degradation of media
  • Minimize cost
  • minimize the number of tapes that must be managed
  • recover floor space
  • Keep pace with data growth
  • provide higher Archive Bandwidth

17
Migration Costs
  • Media costs are fixed
  • price of each new tape technology cartridge is
    the same as the previous cartridge, but the
    capacity is doubled (so far)
  • Cost is then
  • 1 1/2 1/4 1/8 2 times original
    price
  • (additional assumption labor cost is minimized
    by using a tape robot)

18
Concept 2
  • Automation of all processes is essential if costs
    are to be minimized
  • eliminate manual manipulation of tapes
  • robots
  • eliminate manual manipulation of digital objects
  • data handling systems
  • eliminate manual discovery of digital objects
  • information catalogs

19
Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
20
ERA Concept model
21
Concept 3
  • Persistent Archive is an Interoperability System
  • persistence requires migration over time onto new
    technology
  • while the migration occurs, a persistent archive
    must be able to interoperate with both the old
    technology and the new technology
  • employ XML-based interoperability
    (mediation) technology

22
Implicit Concepts for Persistent Archive
  • Infrastructure independence
  • Non-proprietary formatting
  • Collection management
  • Data set access
  • Authentication
  • Presentation
  • Information models
  • XML as a (meta-) information markup language
  • Example GML - Graphics markup language
  • Support for ingestion, management, access
  • Accessioning workbench, archive, access workbench

23
XML as a Standard Information Markup Language
  • XML representation of metadata attributes
  • standardization of DTDs - MOA II DTD for text
  • standardization of markup language
  • XML based representation of collection structure
  • attributes defining the physical layout of a
    schema into relational tables (foreign keys,
    attribute data types, )
  • XML databases XML organized data collections
  • commercial systems Excelon, TAMINO, Oracle8i,
    ...
  • XML-based queries (XQuery, Quilt, XQL, XMAS, ...)
  • XML based Topic Maps
  • represent relationships between collection domain
    concepts, collection attributes
  • navigational access of intra- and
    intercollection concept spaces

24
Archival Example E-mail Collection
  • Test of the scalability of the technology
  • archived a one-million record E-mail collection
    (1999)
  • Ingestion
  • tagged E-mail using XML syntax (6 required,
    13optional, 1000 user-defined tags) wrapping
    of raw data
  • created description of the collection
  • aggregated E-mail into containers, stored in an
    archive
  • retrieved collection description, created
    database, and optimized for query
  • Total time was 27 hours (used 10 Mbit/sec
    Ethernet)

25
Collection-Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
26
Collection-Based Persistent Archive Processes
27
What Types of Interoperability are Needed?
  • Data management (digital objects)
  • ability to work with multiple types of storage
    systems, across separate administration domains
  • Information management (attributes)
  • ability to define a collection independent of
    database choice
  • ability to migrate collection onto new databases
  • Knowledge management (relationships)
  • ability to manage relationships and high-level
    domain concepts
  • ability to map concepts to collection attributes

28
Simplest Definitions
  • Data
  • digital object, i.e., the object representation
    as a bit stream
  • Information
  • any tagged data, where tags are treated as
    information attributes
  • attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • higher-order concepts and relationships between
    attributes
  • relationships can be procedural, temporal,
    structural, spatial, functional, ... and
    described in a Logic formalism (semantic
    networks, description logics, conceptual graphs,
    ...) which is often rule-based (e.g. Datalog,
    Frame-Logic)

29
(No Transcript)
30
Types of Knowledge Relationships
  • Logical / semantic
  • e.g. Digital Library cross-walks
  • Temporal / procedural
  • e.g. Workflow systems
  • Spatial / structural
  • e.g. GIS systems
  • Functional / algorithmic
  • e.g. scientific feature analysis

31
Knowledge-Based Persistent Archive (more Part
II)
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
32
Creating Archivable Forms
  • Archivable form for digital objects
  • infrastructure-independent/self-describing...
  • ... mechanism for describing the digital object
    format
  • ... mechanism for displaying the digital object
    based upon the digital object description
  • no proprietary formats
  • information content directly tagged
  • based on XML (for information tagging,
    interoperability, ...)
  • Example Archival of Web information

33
Wrapping Websites into XML for Archival
  • ... based on material by

Valter Crescenzi Visiting Scholar San Diego
Supercomputer Center University Of California,
San Diego crescenz_at_sdsc.edu Dipartimento di
Informatica ed Automazione Universita' di Roma
Tre crescenz_at_dia.uniroma3.it
34
Outline
  • General issues for archival of websites
  • Specific aspects of archival of websites
  • gathering information
  • data extraction
  • Extraction of information content from web pages
  • first results
  • effort needed to extract information from web
    pages
  • recommendations to minimize wrapping effort

35
Website Archival Issues
  • How to access?
  • locally ( /htdocs/ )
  • remotely ( http// )
  • What to archive?
  • a binary image of the site
  • (explicit) information content only
  • behavior?
  • as is (e.g., existing cgi scripts)
  • equivalent functionality

36
Gathering Information
  • Def. A website is completely crawlable if all
    pages can be automatically copied to a local
    archive
  • Fact Many (most?) websites are not completely
    crawlable!
  • example service-oriented websites
    (maps.yahoo.com, ... )
  • Many mirroring tools are available if the site is
    crawlable, i.e. through index pages
  • If the sites URLs can be enumerated, a
    specialized mirroring tool may easily be developed

37
Focus on Crawlable Data-intensive Websites
Data-Centricity
Structuredness
Data-intensive Sites
Web-Based Information System
high
Senate NeuronDB
www.amazon.com
www.hotmail.com
maps.yahoo.com
low
Service-Oriented Sites
Web-Presence Sites
GridPortal
low
high
Complexity Of Applications
38
Information Extraction Approach
  • Information content may be extracted from
    documents through dedicated software modules
    called wrappers

Information content (e.g. in XML)
Web pages (X)HTML
39
Extraction of Information Content
  • Wrappers
  • brittle
  • difficult to maintain
  • mean time to failure quite short!
  • manually coded/automatically generated
  • manually very expensive, but only possibility if
    sources lack structure
  • automatically cheap but needs structure

40
Case Study for Extraction of Information CONTENT
The "Senate Web Site"
  • Bill Status
  • Summary Info
  • Site is crawlable
  • through URL enumeration
  • through indices

41
Sample XML Output File
42
Why Wrapping HTML XML?
  • What is being archived?
  • presentation (maybe)
  • content (most likely)
  • Technology/infrastructure independence
    "better" persistent archival format
  • XML can express information model and schema
    information (while HTML only provides fixed
    structural and presentational information)

43
Example Minerva Wrapper Specification
S out.println("encoding'ISO-8859-1' ?")
out.println("'BillSummary.dtd'") out.println("")
Bill\sSummary\s\sStatus\sfor\st
he\sCongress\sCongress

BillNumber
((nbsp)"("")")?
out.println(" "Congress""
) out.println("
"BillNumber"")
(NewStyle OldStyle)
TitleList Status ...
44
Steps of the Manual Wrapping Process
  • Initial wrapper specification has been derived
    from a few sample bills from the 106th congress
  • Refinement using ore samples (including from
    other congresses)
  • changes in the HTML layout starting from 105th
    have been discovered!!
  • many manual fixes were needed (irregularities in
    the structure)
  • use of a random bill URL generator for testing
    the wrapper two more fixes

45
Changes in the HTML Layout
H.R.4236 (Major Legislation)
Public
Law 104-333 (11/12/96)
SPONSOR href""Rep Young, D. (introduced
09/27/96)
HTML code from a bill of the 104th congress
H.J.RES.25
Sponsor href""Rep Livingston, Bob (introduced
1/9/1997)
Latest Major Action 2/3/1997
Became Public Law No 105-1.

Title Making technical corrections to the
Omnibus Consolidated Appropriations
Act, 1997 (Public Law 104-208), and
for other purposes.
Corresponding code from a bill of the 105th
congress
46
Level of Development Effort (Senate Website)
  • Manual Approach
  • Around 350 lines of wrapper code
  • 250 lines of grammar specification
  • 100 lines to specify output format
  • One full day to write the specification
  • One full day to test debug it
  • Automatic Approach
  • automatic wrapper generator fails!

47
Wrapping a Structured Website NeuronDB
  • NeuronDB is a well structured site which
    presents information content of an underlying
    database about neurons
  • A wrapper has been manually coded for comparing
    the efforts needed to wrap this structured site
    and the Senate site
  • The automatic wrapper generator was able to
    successfully build a wrapper without user
    interaction

48
(No Transcript)
49
Level of Development Effort (Neuron DB)
  • Manual Approach
  • Around 220 lines of code
  • 140 lines of grammar specification
  • 80 lines to specify output format
  • Half day to write the specification
  • The wrapper extracts all information content
  • Automatic Approach
  • automatic wrapper generator succeeds

50
Automatic Wrapper Generation Succeeds for NeuronDB
  • The wrapper generator toolkit is part of an
    ongoing project at the Terza Universita di
    Roma called RoadRunner
  • The wrapper has been automatically generated
    looking at similar pages without user
    interactions
  • wrapper generation takes a few seconds

51
Neuron DB the inferred schema
  • A common schema (expressed as regular expression)
    is inferred for input pages

A B ( C ) ( D ( ( E F ( G )? ) )? ( H ( I
)? ) ( J K ( L )? ) ) ( M N )
  • The schema is enriched with the extraction
    rules needed to actually wrap sources
  • This is a kind of physical schema of the HTML
    layout, not a logical schema

52
NeuronDB Result of the data extraction
53
Results (1)
  • Make sites crawlable for remote archival
  • e.g. archival backdoor
  • Extracting information from web sites may be very
    expensive depending on the (ir)regularities of
    pages
  • The more regular the structure, the cheaper
    the writing of wrappers
  • Automatic approaches are feasible for well
    structured web pages!

54
Results (2)
  • Well-Structured Sites
  • not only for data archival but also to minimize
    cost for web site maintenance and management
  • XHTML can help (simplifies XHTML XML)
  • Separation of content and presentation
  • XML XSL(T)

55
References (1)
  • G. Mecca, P. Atzeni Cut and Paste - Journal of
    Computing and
  • System Sciences, Special Issue on PODS'97, 1999
  • DOM The document object model.
    http//www.w3.org/DOM/
  • D. W. Embley, D. M. Campbell, Jiang Y. S., S. W.
    Liddle, Ng Y., D. Quass,
  • and Smith R. D.
  • A conceptual-modeling approach to extracting
    data from the web.
  • In ER98.
  • N. Kushmerick. Wrapper Induction Efficiency and
    expressiveness
  • Artificial Intelligence, 118, 200
  • V. Crescenzi, G. Mecca Grammars Have Exceptions
  • Information Systems, Special Issue on
    Semistructured Data, 1998

56
References (2)
  • B. Adelberg NoDoSe a tool for
    semi-automatically extracting struc-
  • tured and semistructured data from text
    documents. In SIGMOD98.
  • The Neuron DB Web Site http//senselab.med.yal
    e.edu/senselab/NeuronDB/
  • S. Grumbach, G. Mecca In Search of the Lost
    Schema -
  • In Proceedings of Intern. Conference on
    Database Theory (ICDT'99), 1999
  • The Senate Web Site http//thomas.loc.gov/
  • The Tidy Utility http//www.w3.org/People/Ragg
    ett/tidy/
  • The W3C XHTML activity. http//www.w3.org/MarkU
    p/
  • Extensible Markup Language (XML),
    http//www.w3.org/XML/
  • Extensible Stylesheet Language (XSL).
    http//www.w3.org/Style/XSL/

57
Overview
  • Part I
  • SDSCs Persistent Archives Initiative
    (material from Reagan Moore, Deputy
    Director SDSC, Data Intensive Computing
    Environments)
  • Example/Case Study
    Wrapping Websites into XML for
    Archival (material from Valter
    Crescenzi, visiting from U Roma 3)
  • Part II
  • From Collection-Based to Self-Validating
    Knowledge-Based Archives
    (joint work
    with Reagan Moore and Richard Marciano)
  • Running Example The Senate Collection

58
WARM UP XML (eXtensible Markup Language)
  • origins HTML SGML (ISO Standard, 1986, 600pp)
  • W3C standard (26 pp) XML syntax DTDs
  • XML HTML ? presentational tags
  • user-defined DTD
    (tagsnesting)
  • a metalanguage for defining other languages
    (e.g. via DTDs)
  • XML is more like SGML than HTML
  • XML SGML ? complexity, document perspective
  • simplicity, data
    exchange perspective

59
XML as a Self-Describing Data Exchange Format
  • can be easily understood by our friend (...
    even using CP/M edlin)
  • can be parsed easily
  • contains its own structure (parse tree) in the
    data
  • allows the application programmer to
    rediscover schema and content/semantics (to
    which extent???)
  • may include its own schema description (e.g.,
    DTD, XML Schema)
  • meta-language definition of specific
    languages (XYZ-ML)
  • allows separation of marked-up content from
    presentation (style sheets)
  • many tools (and many more to come -- (re)use
    code) parsers, validators, query languages,
    storage,
  • standards (good for interoperation, integration,
    etc)
  • generic standards (XML, DTDs, XML Schema,
    XPath,...)
  • community/industry standards (specific markup
    languages)

60
Mind Your Vocabulary Identifying Vocabularies
with XML Namespaces
  • My element may not be your element
  • geometry context line
  • chemistry context oxygen
  • SGML/XML context ....
  • use XML namespaces to identify the vocabulary
  • ... when I say semantics, I can make clear
    whether I am talking as a logician (needs
    additional specificiation mathematical logic,
    philosophy, AI, ...) or a linguist, or a
    psychologist, etc.

61
XML Namespaces
  • mechanism for globally unique tag names
  • xmlnsh"http//www.w3.org/HTML/1998/htm
    l4"
  • Book Review
  • ...
  • XML A Primer
  • ...
  • mix of different tag vocabularies without
    confusion
  • namespaces only identify the vocabulary
    additional mechanisms required for structure and
    meaning of tags

62
Information Hierarchy (Simplest Definitions)
  • Data
  • digital object, i.e., the object representation
    as a bit stream
  • Information
  • any tagged data, where tags are treated as
    information attributes
  • attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • higher-order concepts and relationships between
    attributes
  • relationships can be procedural, temporal,
    structural, spatial, functional, ... and
    described in a Logic formalism (semantic
    networks, description logics, conceptual graphs,
    ...) which is often rule-based (e.g. Datalog,
    Frame-Logic)

63
Types of Knowledge Relationships
  • Logical / semantic digression
    semantics ?semantics
  • e.g. Digital Library cross-walks
  • Temporal / procedural
  • e.g. Workflow systems
  • Spatial / structural
  • e.g. GIS systems
  • Functional / algorithmic
  • e.g. scientific feature analysis

64
Knowledge-Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
65
Data Handling System
SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
Third-party copy
User
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
66
Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
67
Running Example Senate Collection
  • from the 106th US Congress database
  • keeps track of Senate bills, resolutions, and
    amendments
  • raw format 99 RTF (Rich Text Format) files on
    CD-ROM (provided by NARA)
  • one file per senator

68
Examples of Implied KnowledgeSenate Legislative
Activities
  • Structural knowledge
  • Pertinent information embedded in document
    headers
  • Procedural knowledge
  • Naming convention
  • Senator represented by last name
  • Senator represented by last name and state
  • Senator represented by last name, first name, and
    state
  • Collection knowledge
  • Referenced senators include senators no longer in
    the senate

69
Knowledge Generation
  • Accessioning Template
  • Defines the concepts under which the data objects
    will be tagged and organized
  • Attribute selection
  • Define the attributes that represent the
    information content associated with the domain
    concepts
  • Tag attributes using minimal constraint language,
    such as XML or XMLSchema
  • Evaluate closure of mined attributes compared to
    expected attributes
  • Refine concept map

70
Information Generation
  • Create occurrence index
  • (Occurrence, attribute, value)
  • This is needed to be able to recreate original
    form of digital object
  • Analyze completeness of information
  • Inverse index of attribute values
  • Identifies unexpected values - consistency
  • Analyze closure of collection
  • Are additional attributes needed to represent
    inverse index value ranges?

71
Data Organization
  • Archive preferred views of collection
  • Original data
  • XML tagged representation
  • Minimal representation of consolidated
    information
  • Noise-freeversion based upon occurrence tags
  • Object-relational database version
  • Archive occurrence tagged view
  • Archive ingestion procedures that transform
    collection from the original digital objects to
    the preferred views

72
Information Management Projects
  • Digital Libraries
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • Digital Embryo digital library - GMU
  • NPACI Digital Sky - Caltech 2MASS sky survey
  • CDL - AMICO
  • NSF NSDL - UCAR / DLESE
  • Grid Environments
  • NASA Information Power Grid - NASA Ames
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NSF Grid Physics Network - U Fl
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Scalable archives

73
Research and Development Activities - FY00
  • Demonstration of scalable systems
  • Expansion of persistent archive Framework
  • Knowledge-based persistent archives
  • Demonstration of archivable forms for new types
    of data
  • Web, GIS, compound documents, collections
  • Knowledge and anomaly processing
  • Tightness of fit of XML DTDs
  • Self validating archives as a preservation
    strategy

74
Research Challenges
  • Infrastructure independence
  • Progress on archivable form creation
  • Digital paper
  • Finding aids for a million collections
  • Concept spaces that support identification of
    collection
  • Product authentication
  • Tracking all updates, movements, media
    migrations, collection instantiations
  • Choice of Archival Markup Language
  • Tracking of E-commerce implementations
  • Knowledge management systems
  • Workflow, ingestion processing steps, system
    evolution procedures, finding aid concept spaces

75
Digital Archives
  • Problem
  • How to achieve long-term preservation of
    information (for the archivist records) and
    sustained access?
  • Challenges and Opportunities
  • fight archives obsolescence (in the presence of
    with)
  • rapidly changing storage, data formats, software
    environment, hardware,
  • Approaches
  • Time out (do nothing assume hardware,
    software, data formats, etc. all work 400 years
    from now ...)
  • Emulation (emulate hardware and software
    infrastructure)
  • Migration (migrate to new infrastructure)
  • Factors
  • What do you need to archive? (records, data,
    programs, ?)
  • determine usefulness and cost of emulation
    vs. migration
  • archival of electronic records data-centric
    migration

76
What is it That We Try to Archive??
  • What constitutes a record?
  • beats me...
  • but there are hierarchies of information /
    abstractions
  • data ... information ... knowledge ...
    wisdom?
  • instance ... schema ... model ... metamodel ...
    metametamodel ...
  • object serialization ... data structure ... data
    model ... meta model
  • What is the nature of the information?
  • data .. functions/programs
  • extensional data . intensional/virtual/derived
    data (facts/rules)
  • Managing complexity using layers
  • protocol stacks (e.g. ISO/OSI, SemanticWeb,
    Semantic Mediation)
  • going up abstract, correlate, aggregate,
    index, the lower levels

77
Archival Processes and Functions
  • Data Submission/Accessioning
  • loop information producer "archival
    engineer (ok archivist)
  • Ingestion
  • a sequence of information preserving
    transformations is applied to submitted "raw
    data" ingestion network
  • Migration
  • ... as time goes by ...
  • ... migrate to new physical media, maybe data
    formats, information model ...
  • "easy migration" "good" archival format
    model
  • Instantiation/Access
  • revive/reanimate the archive queryable
    collection/database
  • Goal preserve information!
  • (ok just records ...)

78
Archival Example Senate Collection
  • What you see
  • is maybe NOT what you get (a not so well
    documented format)

79
Senate Collection Example
  • Rich Text Format (a documented Microsoft
    format)

\pard\parM \pard\b S. 345\b0\parM
\pard\qr DATE INTRODUCED 02/03/1999\parM
\pard SPONSOR Allard\parM \i\qc OFFICIAL
TITLE\i0\parM \pard A bill to amend the Animal
Welfare Act to remove the limitation that permits
\ interstate movement of live birds, for the
purpose of fighting, to States in which \ animal
fighting is lawful.\parM \i\qc LATEST
STATUS\i0\par\pardM \pard\plain
\fi-1900\li1900\nowidctlpar\adjustrightFeb 3,
1999\tab Read twice and\ referred to the
Committee on Agriculture.\parM \pardM
  • can be wrapped into XML

S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
80
Senate Collection Example
  • the XML can be lifted from the presentation
    level

S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
  • to the information level


SENATE AGRICULTURE
02/03/1999te_introduced
Feb 3,
1999
Read twice and referred to the
Committee on Agriculture

A bill to amend the Animal
Welfare Act to remove the limitation that permits
interstate movement of live birds, for the
purpose of fighting, to States in which animal
fighting is lawful.
Allard, Wayne CO

81
XML as an Archival Format
  • Information level schema as an XML DTD

bills (bill) committees?, congressional_record?, cosponsors?,
date_introduced?,
digest?, latest_status_list?, official_title?,
sponsor?, statement_of_purpose?,
submitted_by?, submitted_for?) bill_name CDATA REQUIRED (committee) (cosponsor) T latest_status_list (latest_status) latest_status (ls_date, ls_txt) abstract (PCDATA) (PCDATA) (PCDATA) T co_name (PCDATA) CDATA IMPLIED (PCDATA) (PCDATA) (PCDATA)
82
Open Archival Information System (OAIS)
Information Model
  • An AIP (archival information package) contains
  • content information (CI) (represented as
    info_objects), and
  • preservation description information (PDI)
  • (A)IP (archival) information package
  • DI descriptive information
  • PI packaging information (ISO-9660 for CD
    directories)
  • CI content information
  • PDI preservation description
    information
  • PR
    provenance (origin, processing history)
  • CON context (relation to external
    information)
  • REF reference (identifies the CI, e.g.,
    ISBN, URI)
  • FIX
    fixity (e.g., checksum over CI)

83
Archival Ingestion Networks
Transformation t is information preserving, if
it is reversible, i.e., if there is an inverse
t_inv, s.t., for all d in dom(t) t_inv(
t( d ) ) d .
  • Example
  • d1, d2, ? HTML wrapper d1, d2, ?
    XML
  • d1, d2, inverse wrapper (XSLT) d1,
    d2, ? HTML
  • asking for exact inverse often not practical
  • consider e.g. normalized HTML or restrict to
    higher level representations

84
Ingestion Network Senate Collection
85
From XML-Based to Knowledge-Based Archives
  • Collection-based archival with XML save data "as
    is" plus...
  • ... separate content from presentation
  • ... tag your data (take a lift in the info
    hierarchy)
  • ... use a self-describing, semistructured data
    format (XML)
  • Knowledge-based archival now add ...
  • ... conceptual level information
  • ... integrity constraints
  • ... explanations/derivation rules
  • archiving only results yf(x) vs. archiving the
    rules/function "f" (e.g. f the
    Florida procedure...)
  • employ knowledge representation languages

86
Knowledge-Based Archival Senate Example
  • Data provider says
  • Please archive all records of legislative
    activities of the 106th senate!
  • Integrity constraints, eg
  • (1) senators_with_file UNION (sponsor,
    cosponsors, submitted_by)
  • (2) senators sponsors co-sponsors
  • Violation
  • the rhs is a SUPERSET of the lhs !
  • Exceptions
  • (Chafee, John), (Gramm, Phil), (Miller, Zell)
  • (Possible) Explanations
  • senators who joined (Zell), passed away (Chafee),
    were forgotten (Gramm)!?
  • Checking ICs
  • IF sponsor(X), not senator(X) THEN
    ADD(exception_log, missing_senator_info(X))
  • IF condition THEN action
  • Action LOG, WARN,
    ABORT, ...

87
Maximizing Self-Containedness ...
  • Self-validating archives add ...
  • ... "executable knowledge" (rules)
  • "helping (bugging?) the data provider"
  • add the functionality and meaning of DTD
    (SchemaIC...) validation to the AIP
  • package the validator!
  • Self-instantiating archives add ...
  • ... "executable ingestion process"
  • helping the archival engineer (aka archivist)
  • here is looking over your shoulder
  • add the functionality of database
    transformations to the AIP
  • package the transformers!
  • BUT packaging validators and transformers
    increases infrastructure dependence!

88
Maximize Self-Containedness ...While
Minimizing Infrastructure Dependence
  • Basic Idea use a language of executable
    specifications for self-validation and
    self-instantiation!
  • Use Bootstrapping for Self-Validating
    Self-Instantiating Archives
  • Example DTD Validator in Logic (F-Logic,
    Datalog,)
  • specify
  • false IF PX, not (P1.X)Y.
  • false IF PX, not (P2.X)Y.
  • false IF PX, not P_- _.
  • false IF PXN-_, not N1, not N2.
  • ...

89
XML Extensions as General Constraint Languages
  • Assume an archival language A for IPs (e.g.
    AXML)
  • Def. C is a constraint language for A, if for all
    ? ? C, the set of valid archives V(?) a ? A
    a ? is decidable.
  • Example C XML_DTD, ? Senate_DTD
  • Def. C subsumes C (C ? C) w.r.t. A, if for
    all ? ? C there is an encoding enc(? ) ? C s.t.
    for all a ? A
  • a ? iff a enc(?)
  • Proposition
  • XML_Schema ? XML_DTD
  • F_Logic,Datalog ? XML_DTD

90
Summary Towards Bootstrapping Knowledge-Based
Archives
  • enable addition of semantic annotations
    ("knowledge") via logic rules to AIPs
  • add executable specifications of semantics
    AIP KP (knowledge package,
    i.e., logic ules)
  • self-validating archive
  • add executable specifications of the ingestion
    network AIP IN (ingestion network, ...more
    logic rules)
  • self-instantiating archive
  • bootstrapping knowledge-based archive with
    DTD/Schema/IC validation and ingestion
    transformations all expressed in a declarative
    logic program
  • Outlook from the 2do list build a prototype
    BARON Bootstrapping Archive of Rules,
    Ontologies, and Ingestion Networks

Baron von Münchhausen, pulling himself out of the
swamp
91
References
  • Towards Self-Validating Knowledge-Based
    Archives, Bertram Ludäscher, Richard Marciano,
    Reagan Moore, 11th Workshop on Research Issues in
    Data Engineering (RIDE), Heidelberg, IEEE
    Computer Society, April 2001, SDSC TR-2001-1,
    January 18, 2001.
  • Knowledge-Based Persistent Archives, Reagan
    Moore, SDSC TR-2001-7, January 18, 2001
  • The Senate Legislative Activities Collection
    (SLA) a Case Study Infrastructure Research to
    Support Preservation Strategies, Richard
    Marciano, Bertram Ludäscher, Reagan Moore, SDSC
    TR-2001-5, January 18, 2001
  • Reference Model for an Open Archival Information
    System (OAIS), Draft Recommendation, Consultative
    Committee for Space Data Systems, CCSDS
    650.0-R-1, May 1999.
  • Digital Rosetta Stone A Conceptual Model for
    Maintaining Long-term Access to Digital
    Documents, Alan R. Heminger, Steven B. Robertson

92
ADDITIONAL MATERIAL AHEAD
93
Collection-Based Archival with XML
  • Archival Formats Desiderata
  • standardized, open, as simple as possible,
  • self-contained and self-describing
  • XML provides a good framework for archival
  • Data/Instance Level records/objects/tuples
  • content information (CI)
  • Schema/Class Level
  • collection structure metadata, types
  • packaging information (PI) and descriptive
    information (DI)
  • Missing in Action...
  • conceptual level information relationships
    between collection attributes/classes, integrity
    constraints, derived knowledge,
  • parts in CON, PI, but need for knowledge packages
    (KPs)
  • Knowledge-Based Archival

94
Getting your hands dirty with logic rules
  • Some logic rules for reassembling the doc
    structure (lexical scopes) from the OAV (or
    rather AOV)

attr_interval(Attr, SID, Attr_val, LN, LN1) -
oav(Attr, (SID, LN), Attr_val),
oav(Attr, (SID, LN1), _), LN1 LN,
not attr_between(Attr,SID,LN,LN1).
attr_between(Attr,SID,LN,LN1) -
oav(Attr, (SID, LN), _), oav(Attr, (SID,
LN1), _), oav(Attr, (SID, LN2), _),
LN
95
Summary what is the declarative (logic) approach?
  • Use of declarative database and knowledge
    representation formalisms for...
  • adding knowledge packages to AIPs
  • capture context known at the time of archival
    using conceptual models of collections, integrity
    constraints, virtual relations,
  • applying them at ingestion, migration, and
    instantiation/access time
  • ( wrapping, transforming, querying
    collections)
Write a Comment
User Comments (0)
About PowerShow.com