Introduction to the Open Archives Initiative Protocol for Metadata Harvesting - PowerPoint PPT Presentation

Loading...

PPT – Introduction to the Open Archives Initiative Protocol for Metadata Harvesting PowerPoint presentation | free to view - id: 4a8cf6-ZTNmO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introduction to the Open Archives Initiative Protocol for Metadata Harvesting

Description:

Introduction to the Open Archives Initiative Protocol for Metadata Harvesting Timothy W. Cole (t-cole3_at_uiuc.edu), Mathematics Librarian William H. Mischo (w-mischo_at_ ... – PowerPoint PPT presentation

Number of Views:227
Avg rating:3.0/5.0
Slides: 89
Provided by: Timothy255
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to the Open Archives Initiative Protocol for Metadata Harvesting


1
Introduction to the Open Archives Initiative
Protocol for Metadata Harvesting
  • Timothy W. Cole (t-cole3_at_uiuc.edu), Mathematics
    Librarian
  • William H. Mischo (w-mischo_at_uiuc.edu),
    Engineering Librarian
  • Thomas G. Habing (thabing_at_uiuc.edu), Research
    Programmer
  • Grainger Engineering Library Information Center
  • University of Illinois at Urbana-Champaign
  • Presented 27 May 2003
  • in conjunction with JCDL 2003, Houston, TX
  • http//dli.grainger.uiuc.edu/Publications/TWCole/J
    CDL-OAI

2
Todays Agenda (Part 1)
  • Overview of OAI (Mischo)
  • What it is, where it comes from, what its used
    for
  • Relation to HTTP, XML, Dublin Core, Z39.50
  • Basic Concepts Definitions (Cole)
  • OAI verbs
  • OAI transactions
  • Protocol details architecture options
  • Illustrations
  • Implementation Guidelines for Repositories (Cole)
  • Tools program layout options
  • Metadata generation / mapping
  • Optional protocol elements
  • Error handling deleted records

3
Todays Agenda (Part 2)
  • Tools, testing, problems (Cole)
  • XML OAI validation tools
  • Common problems
  • Implementation Guidelines for Harvesters (Mischo)
  • How to harvest
  • Harvesting policies strategies
  • Harvester Technologies
  • Advanced topics (Cole)
  • Communities
  • OAI Static Repository
  • OAI SOAP
  • Where do you go from here?

4
OAI as a tool
  • All about moving metadata around
  • Designed to be a building block, useable by many
    different communities
  • Can facilitate (in some cases enable) services
    functions
  • Assumes widely distributed content,
    butcentralized indexing(!) services
  • Build once, use for many applications
  • Focus of OAI is interoperability

5
Metadata vs. Information Resources
  • Resource refers to information objects or digital
    representations of information objects
  • Metadata item is a collection of properties about
    a resource (e.g. title, author, etc.)
  • Metadata record is a metadata item expressed in a
    specific syntax according to an XSD
  • OAI focuses on metadata, with the implicit
    understanding that metadata contains useful links
    to the source information object(s)

6
OAI Antecedents
  • Call to other E-Print archives (July 1999)
  • Paul Ginsparg, Rick Luce, Herbert Von de
    Sompel
  • mobilize core group to work towards achieving
    a
  • universal service for author self-archived
    scholarly literature.
  • Santa Fe Mtgs. (Oct. 1999 June 2000)
  • OAI PMH version history
  • First Alpha Release, Sept. 2000
  • 1.0 (Beta) Release January 2001
  • 1.1 (Beta 2) Release July 2001
  • 2.0 (Production) Release June 2002

7
Original OAI Organization
  • OAI Executive
  • Carl Lagoze Herbert Van de Sompel
  • OAI Steering Committee
  • Co-Chairs Dan Greenstein, Cliff Lynch
  • OAI Technical Committee
  • Funded by NSF, DLF CNI
  • Seeks to be user community driven
  • Adopters (selective list)
  • NSDL, NDLTD, Open Archives Forum (EU), JISC/DNER
    (UK)
  • E-Prints.Org, DLXS, DSpace, ContentDM, ENCompass

8
OAI Protocol for Metadata Harvesting
  • Harvesting approachto interoperabilityat
    metadata level
  • Divides world intoMetadata Providers Service
    Providers
  • Builds on HTTP,XML, Dublin Core
  • http//www.openarchives.org/

9
Harvesting/Federation vs. Broadcast
  • Competing approaches to interoperability
  • Distributed/Broadcast searching search and
    discovery over remote services and data
  • Harvesting is when data/metadata is transferred
    from the remote source to the destination where
    the services are located (e.g. Union catalogs)
  • OAI designed to make it easy for providers
  • Low barrier design
  • OAI focuses on harvesting

10
Data and Service Providers
  • Data Providers (Repositories) refer to entities
    who possess resources metadata and are willing
    to share metadata with others via well-defined
    OAI protocols
  • Service Providers (Harvesters) are entities who
    harvest metadata from Data Providers in order to
    supply higher-level services to users (e.g.
    search discovery)
  • OAI uses these denotations for its client/server
    model (dataserver, serviceclient)

11
Reliance on HTTP XML
  • OAI-PMH is a REpresentational State Transfer
    (REST) protocol (unlike RPC, SOAP)
  • OAI requests and responses are sent via the HTTP
    protocol
  • OAI Requests are encoded as HTTP GET or POST
    operations
  • OAI Responses are valid XML documents

12
XML Namespaces and Schema
  • Consistency and data quality is ensured by
    using XML Schema Definitions (XSD) for all
    responses
  • XML Namespaces are used where necessary to
    clearly define which parts of the responses are
    actual metadata and which support the Metadata
    Harvesting Protocol

13
OAI-PMH Use of Dublin Core
  • DC is OAIs lowest common denominator
  • OAI supports encourages use of other,
    community-driven metadata schemas
  • Typically, metadata provider stores metadata in
    best schema as dictated by material resources
  • Crosswalk (semantic mapping) to simpler schemas
  • Semantic mapping at metadata delivery (rather
    than at time of search)
  • As with Z39.50, cant search for whats not there

14
As Compared to Z39.50
Z39.50 OAI
Content (Objects) Distributed Distributed
World View Bibliographic Bibliographic
Object Presentation Data provider Data provider

Searching is Distributed Centralized
Search done by Data provider Service provider
Metadata searched is Up to date Stale
Semantic Mapping When searching Metadata delivery
15
What OAI Is Not
  • Not search
  • Not database
  • Not metadata
  • Not OAIS

16
What OAI is good for
  • Where content is widely distributed, in different
    kinds of non-Z39.50 enabled locations
  • Metadata provider more lightweight than Z39.50
  • Metadata provider scales wellService provider
    scales according to search capability
  • Metadata is sufficient for services desired
  • Normalization, dedupping, augmentation desired
  • Not mutually exclusive
  • Portals can use both Z39.50 OAI

17
The NSDL metadata repository
Services
The metadata repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL.
Users
Metadata repository
From The NSDL Metadata Strategy, A
presentation by William Y. Arms and Diane I.
Hillman. Available http//nsdl.comm.nsdlib.org/al
lprojects01/metastrategy.ppt
Collections
18
  • NSDL Metadata strategy Support eight
    standard formats
  • Collect all existing metadata in these
    formats
  • Provide crosswalks to Dublin Core
  • Expose records in the metadata repository for
    service providers to harvest
  • Concentrate human effort on collection-level
    metadata
  • Use automatic generation to augment
    item-level metadata

From The NSDL Metadata Strategy, A
presentation by William Y. Arms and Diane I.
Hillman. Available http//nsdl.comm.nsdlib.org/al
lprojects01/metastrategy.ppt
19
IMLS Digital Collections Content
  • Build a registry of all National Leadership Grant
    collections with digital content.
  • Assist and guide NLG projects in making
    item-level metadata sharable using OAI.
  • Build a repository and search discovery tools
    for integrated access to the content of NLG
    collections (unique metadata schema?).
  • Research best practices for sharing metadata
    about diverse digital content and for supporting
    the interests of diverse user communities.

20
http//imlsdcc.grainger.uiuc.edu/
21
Open Language Archive Community
  • Supports the OLAC Protocol for Metadata
    Harvesting based on OAI
  • Includes metadata extensions to DC
  • Supports Qualified DC refinements and encodings
    and unique OLAC attribute code to hold
    restricted element values
  • Also supports OLAC Static Repository Gateway
    based on OAI Static Repository (still alpha)
  • Developing an OLAC Repository Editor for
    creating a metadata provider

22
Basic Concepts Definitions
  • OAI verbs
  • OAI transactions
  • Protocol Details
  • Architecture Options
  • Illustrations

23
How OAI Works
  • OAI VERBS
  • Identify
  • ListMetadataFormats
  • ListSets
  • ListIdentifiers
  • ListRecords
  • GetRecord

Service Provider Metadata Provider
H A R VESTER
REPOSITORY
OAI
OAI
HTTP Request
(OAI Verb)
HTTP Response
(Valid XML)
24
Identify
  • Purpose
  • Return general information about the archive and
    its policies (e.g., datestamp granularity)
  • Parameters
  • None
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbIdentify

25
ListSets
  • Purpose
  • Provide a listing of sets in which records may be
    organized (may be hierarchical, overlapping, or
    flat)
  • Parameters
  • None
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbListSets

26
ListMetadataFormats
  • Purpose
  • List metadata formats supported by the archive as
    well as their schema locations and namespaces
  • Parameters
  • identifier for a specific record (O)
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbListMeta
    dataFormats

27
ListIdentifiers
  • Purpose
  • List headers for all items corresponding to the
    specified parameters
  • Parameters
  • from start date (O)
  • until end date (O)
  • set set to harvest from (O)
  • metadataPrefix metadata format to list
    identifiers for (R)
  • resumptionToken flow control mechanism (X)
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbListIde
    ntifiersmetadataPrefixoai_dc

28
GetRecord
  • Purpose
  • Returns the metadata for a single item in the
    form of an OAI record
  • Parameters
  • identifier unique id for item (R)
  • metadataPrefix metadata format for the record
    (R)
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbGetReco
    rdidentifieroaitest123metadataPrefixoai_dc

29
ListRecords
  • Purpose
  • Retrieves metadata records for multiple items
  • Parameters
  • from start date (O)
  • until end date (O)
  • set set to harvest from (O)
  • resumptionToken flow control mechanism (X)
  • metadataPrefix metadata format (R)
  • Sample URL
  • http//www.anarchive.org/cgi-bin/OAI?verbListRec
    ordmetadataprefixoai_dcfrom2001-01-01

30
Protocol Details
  • OAI Transaction An OAI request (HTTP)
    corresponding OAI response (XML)
  • Optional use resumptionToken other flow
    control mechanisms to manage service load
  • Item Identifiers Persistence Uniqueness
  • Item Datestamps Date of last metadata change
    supports selective harvesting

31
Examples of OAI Requests
  • http//www.language-archives.org/cgi-bin/olaca3.pl
    ?verbIdentify
  • http//publications.uu.se/portal/OAI?verbListSets
  • http//www.language-archives.org/cgi-bin/olaca3.pl
    ?verbListMetadataFormats
  • http//www.language-archives.org/cgi-bin/olaca3.pl
    ?verbListIdentifiersmetadataPrefixoai_dcfrom
    2002-12-01
  • http//www.language-archives.org/cgi-bin/olaca3.pl
    ?verbGetRecordmetadataPrefixoai_dcidentifier
    oai3Aacl.sr.language-archives.org3AA00-1006

32
An OAI Response
  • lt?xml version"1.0" encoding"UTF-8" ?gt
  • ltOAI-PMH xmlns xmlnsxsi xsischemaLocation
    gt
  • ltresponseDategt2002-05-01T192030Zlt/responseDate
    gt
  • ltrequest verb"GetRecord" identifier"oaiarXi
    vhep-th/9901001 metadataPrefix"oai_dc"gt
  • http//an.oa.org/OAI-scriptlt/requestgt
  • ltGetRecordgt
  • ltrecordgt
  • ...
  • lt/recordgt
  • lt/GetRecordgt
  • lt/OAI-PMHgt

33
An OAI Record
  • ltheadergt
  • ltidentifiergtoaiarXivcs/0112017lt/identifiergt
  • ltdatestampgt2002-02-28lt/datestampgt
  • ltsetSpecgtcslt/setSpecgt
  • lt/headergt
  • ltmetadatagt
  • ltoai_dcdc xmlnsgt
  • ltdctitlegtUsing Structural Metadatalt/dctitle
    gt
  • lt/oai_dcdcgt
  • lt/metadatagt
  • ltaboutgt
  • ltprovenance xmlnsgt
  • .
  • lt/provenancegt
  • lt/aboutgt

34
Unique Identifiers
  • Each item must have a unique identifier
  • Identifiers must follow rules for valid URIs
  • Example
  • oailtarchiveIdgtltrecordIdgt
  • oaietd.vt.eduetd-1234567890
  • Each identifier must resolve to a single item and
    always to the same item
  • Cant reuse OAI item identifiers

35
Datestamps
  • Needed for every OAI record to support
    incremental harvesting
  • Must be updated when addition or modification or
    deletion made in order to ensure changes are
    correctly propagated to harvesters
  • Different from dates within the metadata OAI
    datestamp is used only for harvesting
  • Can be either YYYY-MM-DD or YYYY-MM-DDThhmmssZ
    (must be GMT timezone)

36
OAI Provider Architectures
Descriptive Metadata
OAI Administrative Metadata
OAI Harvesters
37
Architecture Options
  • Metadata items in database
  • If individual metadata items are stored in a
    database
  • Usually requires programmatic mapping to DC
  • Metadata items as XML files
  • If individual metadata items already in XML, can
    do without the database component, or can use
    database to cache and/or hold OAI administrative
    metadata
  • May use XSLT stylesheets to extract / map
    metadata
  • Metadata elements in HTML files
  • As with XML file system options
  • Static repository option (more later)

38
Technology Options
  • WWW Server (e.g., Apache, MS IIS)
  • Protocol may be implemented in many forms
  • CGI Script (Perl, C, Java)
  • Java Servlet
  • PHP
  • Metadata (e.g. database) access mechanism
    required
  • See www.openarchives.org for list of publicly
    available software templates
  • See www.SourceForge.Net for UIUC OAI tools

39
Illustrations
  • Identify
  • ListSets
  • ListMetadataFormats
  • ListIdentifiers
  • GetRecord oai_dc
  • GetRecord olac
  • ListRecords
  • Error

40
15 Minute Break

41
Implementation Guidelines for Repositories
  • Tools Required
  • Basic program layout (incl. object-oriented
    approaches)
  • Optional container elements
  • Metadata generation / mapping, data cleaning
  • Sets
  • resumptionToken, flow control, load-balancing
  • Denial-of-service prevention
  • Error handling
  • Deleted metadata records

42
Typical Pre-Requisites
  • Metadata Web server
  • Code templates if available (available for many
    languages)
  • Basic Web programming environment
  • XML parsers (for non-trivial encoding)
  • Database access libraries/drivers (e.g. ODBC,
    JDBC)

43
Basic program layout
  • parse WWW request to extract parameters
  • if (verbIdentify) Validate arguments
    ProcessIdentify
  • else if (verbListMetadataFormats) Validate
    arguments ProcessListMetadataFormats
  • else if (verbListSets) Validate arguments
    ProcessListSets
  • else if (verbGetRecord) Validate arguments
    ProcessGetRecord
  • else if (verbListIdentifiers) Validate
    arguments ProcessListIdentifiers
  • else if (verbListRecords) Validate arguments
    ProcessListRecords
  • else ReportError (badVerb)
  • Re-usable subroutines to extract / clean up /
    transform metadata, generate standard error
    messages, etc.

44
Object-Oriented Approaches
  • Cleaner separation of protocol, database access
    and metadata generation
  • Example approaches
  • Each service request is handled by a object
  • Simpler incremental development
  • Protocol, Database and Metadata are objects
  • Greater portability of code
  • Inheritance from a basic OAI data provider

45
Provider Performance Issues
  • Database design impacts performance
  • Work required to map to DC
  • Use of resumptionTokens way to improve
    performance
  • Fetch only records needed to satisfy current
    request
  • Queries only retrieve needed records
  • resumptionTokens should retain state information
    for best performance and for idempotency

46
Optional Container Elements
  • ltIdentifygtltdescriptiongt
  • Additional information about repository
  • oai-identifier, eprints, friends, branding,
    other
  • ltListSetsgtltsetDescriptiongt
  • Additional information describing a set
  • ltmetadatagt
  • Other metadata besides Dublin Core
  • rfc1807, marc21, oai_marc, mods, other
  • ltaboutgt
  • Meta-metadata, i.e. record level rights

47
Metadata Generation / Mapping
  • Approaches
  • Map from source to each metadata format
  • Use multiple crosswalks (may use XSLT) to
    transform to multiple metadata formats

source (e.g., DB)
dc
rfc1807
name
title
title


author
creator
author


48
Data Cleaning
  • Escape special XML characters (lt, gt, , )
  • Convert to UTF-8 version of Unicode
  • Convert entity references (e.g., copy)
  • Remove extraneous whitespace
  • URLs
  • /? must be encoded as escape sequences

49
Sets another option for selective harvesting
  • Optional no well-defined semantics depends
    completely on local data providers
  • Must provide setSpec setName, may provide
    setDescription, for each Set in repository
  • Sets may be hierarchical (use ) may overlap
  • Allows for harvesting of sub-collections
  • May be pre-defined by arrangement between data
    providers and service providers
  • E.g. Subject areas, years, author names (but must
    be pre-defined for ListSets)
  • Not a substitute for searching!

50
resumptionToken, flow control, load-balancing
  • Incomplete response resumptionToken can be used
    to return partial results the client is issued
    with a token which may be presented to the server
    to receive more results
  • resumptionToken embeds state information,
    allowing OAI to be stateless even for incomplete
    response model
  • HTTP 503 retry-after mechanism can be used to
    support server-side delaying of a clients
    request
  • HTTP 302 / 303 can be used for load balancing
  • HTTP 4xx can be used to deny a harvester

51
Typical options for resumptionTokens
  • resumptionTokens may have completeListSize,
    cursor, and expiration date attributes
  • Combine from/until/metadataPrefix/set and a
    record number indicator with delimiters into a
    sequential tokenFor example
  • from!until!metadataPrefix!set!recordnumber
  • 2000-01-01!2001-01-01!!All!100
  • Use a session manager with automatic expiry For
    example
  • vtetd14june10amsession12

52
Denial-of-Service Prevention
  • Return only partial results and issue a
    resumption token for more
  • Use 503 retry-after HTTP errors to have clients
    try again after a specified back-off time
  • Use access control lists to limit who may access
    the archive
  • Invoke an explicit delay before sending back
    results

53
Error Handling
  • All protocol errors are in XML format
  • badVerb illegal verb requested
  • badArgument illegal parameter values or
    combinations
  • badResumptionToken, cannotDisseminateFormat,
    idDoesNotExist parameters are in right format
    but are not legal under current conditions
  • noRecordsMatch, noMetadataFormats,
    noSetHierarchy empty response exception

54
Handling Metadata Record Deletions
  • deletedRecord no, transient, or persistent
  • Archives may keep track of deleted records, by
    identifier and datestamp
  • All protocol result sets can indicate deleted
    records (possible to delete a record, but not
    item)
  • Best Practice If deletions are being tracked,
    this information should be stored indefinitely so
    as to correctly propagate to service providers
    with varying harvesting schedules

55
Tools, Testing, Common Problems
  • Validation Testing Tools
  • Repository Explorer (Virginia Tech)
  • OAI Registry
  • XML Schema Validator (e.g., XSV)
  • Reap command-line harvester
  • Common Problems
  • Incomplete / inconsistent metadata
  • New metadata format
  • No unique identifiers !
  • No datestamps !
  • XML responses not validating
  • Character encoding
  • Doesnt conform to XML Schema Definition

56
http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai
57
RE Parameter Testing
58
RE Formatted View of Data
59
RE Raw XML views of data
60
RE Automatic Test Suite
61
RE Error in XML
62
OAI Registry
63
OAI Registry
64
XSV Schema Validator
65
XSV Example
  • Correct XML
  • XSV Result
  • Bad Character
  • XSV Result
  • Invalid Tag
  • XSV Result

66
Incomplete Metadata
  • Synthesize metadata fields based on a priori
    knowledge of the data
  • Example publisher and language may be hard-coded
    for many archives
  • Omit fields that cannot be filled in correctly
    better to have less information than incorrect
    information !

67
New metadata format
  • Find the description, namespace and formal name
    of the standard
  • Find an XML Schema description of the data format
  • If none exists, write one (consult other OAI
    people for assistance)
  • Create the mapping and test that it passes XML
    schema validation

68
No unique identifiers
  • Create an independent identifier mapping
  • Use row numbers for a database
  • Use filenames for data in files
  • Use encoded URL for Web pages
  • Use a hash from other fields
  • E.g. authoryearfirst word in title

69
No datestamps
  • Ignore the datestamp parameters and stamp all
    records with the current date
  • Incremental harvests not possible
  • Create a date table with the startup date for all
    entries, then update dates as entries added /
    changed
  • Most Important Any harvesting algorithm that is
    interoperably stable for an archive with real
    dates should be stable for an archive with
    synthesized dates

70
XML not validating
  • Check namespaces and schema
  • Use Repository Explorer in non-validating mode to
    check structure of XML, without looking at
    namespaces or schema
  • Validate schema by itself if it is non-standard
  • Look at XML produced by other repositories
  • Watch out for character encoding issues

71
Implementation Guidelines for Harvesters
  • How to Harvest
  • Selective Harvesting Granularity
  • Sets
  • Error Recovery
  • Flow Control / Load Balancing / Redirection
  • Incomplete Lists
  • Policies
  • Tools

72
How To Harvest
  • Identify to get basic information
  • ListIdentifiers, followed by ListMetadataFormats
    for each record and then GetRecord for each
    id/metadata combination
  • No. of short HTTP requests 1nn x mnno. of
    identifiers, mno. of metadata formats
  • ListRecords for each metadata format required
  • No. of long HTTP requests mmno. of metadata
    formats
  • Response compression is indicated by
    ltcompressiongt in the Identify response

73
Selective Harvesting Datestamps
  • Day or seconds granularity, declared in the
    Identify response ltgranularitygt
  • All repositories must support from and until
    params at the day granularity
  • This provides for incremental or differential
    harvesting strategies
  • Because records may change or be added during a
    harvest there should be a two-day overlap for
    incremental harvests

74
Sets
  • Sets provide another means of selective
    harvesting
  • ListSets to determine which or even if sets are
    supported
  • May ignore sets
  • Colons () in the setSpec values indicate
    hierarchy.

75
Error Recovery
  • Because of idempotency harvesters can reissue the
    previous resumptionToken, if it hasnt expired
  • Especially useful when harvesting very large
    repositories in cases of network errors or
    disconnects
  • Some harvesters can take multiple days to harvest
    extremely large repositories

76
Flow Control / Load Balancing / Redirection
  • Repositories are free to utilize the various HTTP
    status codes which harvesters must be prepared to
    handle
  • 503 Service Unavailable should include a retry
    after header
  • 403 Service Forbidden
  • 302 Found should include a location header to
    redirect to a new URL
  • Future harvesting requests should continue to use
    original URL if the baseURL in the ltrequestgt
    element remains the same

77
Incomplete Lists
  • Harvesters can receive incomplete list responses
    to ListIdentifiers, ListRecords, and ListSets
    requests
  • Indicated by ltresumptionTokengt in response
  • Next list request is made using content of
    ltresumptionTokengtas value of argument
  • http//an.oai.org/script?verbListIdentifiersresu
    mptionToken2001-01-023A2001-01-033A0
  • resumptionToken value must be correctly encoded
    for HTTP GET and POST

78
Policies
  • Use schedule for harvesting regularly
  • Store date when last harvested (before you start)
  • Use a two day overlap (or one day if your archive
    uses proper UTC datestamps)
  • New items may be added for the current day
  • Timezones create up to a day of lag if you ignore
    them
  • If the source uses correct UTC datestamps and
    second granularity then only 1 second of overlap
    is needed!
  • Each time a record is encountered, erase previous
    instances
  • Harvesters should supply HTTP User-Agent and From
    headers (practices for robots)

79
Technologies for Harvesters
  • To validate or not to validate
  • No, well-formed, strictly valid (checked against
    Schema)
  • Choice of parser MSXML, Apache Project
  • Storing harvested metadat
  • As XML
  • Import to DB on the fly, batch afterwards
  • Indexing tools
  • DLXS, Encompass, Ex Libris, MySQL, SQL Server

80
Advanced Topics
  • Communities
  • SOAP version
  • Envisioned for near future
  • OAI Static Repository
  • http//www.openarchives.org/OAI/2.0/guidelines-sta
    tic-repository.htm

81
OAI Communities
  • Shared Metadata Formats
  • Shared semantics
  • Layering over OAI
  • Closed OAI networks
  • OAI within the DL

82
Shared Metadata Formats
  • Use metadata formats accepted within a community
    to convey more specific information
  • Examples
  • E-Print format (under development)
  • ETD-MS for theses and dissertations
  • VRA Core for multimedia
  • IMS Metadata for educational material

83
Shared Semantics
  • Develop a shared understanding for the meanings
    of fields
  • Examples
  • Developing controlled vocabularies for fields
  • Using specific fields for external links (OAI
    recommends using identifier in DC for this)
  • Choosing from among existing standards (like
    language names)

84
SOAP OAI
  • SOAP Simple Object Access Protocol
  • XML envelope for remote procedure calls
  • Promoted by Microsoft, IBM, W3C
  • OAI community members are exploring provision of
    OAI services using SOAP rather than HTTP
  • May be more attractive to commercial and library
    software vendors

85
OAI Static Repository
86
Where to go from here?
  • DO I REALLY WANT TO DO THIS?
  • Do I have an accessible metadata source?
  • Do I have a server to host the OAI
    script/program?
  • Can I satisfy the requirements to be a data
    provider?
  • Can I write the code or modify a template or hire
    a programmer to do either?

87
Links
  • Open Archives Initiative
  • http//www.openarchives.org
  • OAI Metadata Harvesting Protocol
  • http//www.openarchives.org/OAI/openarchivesprotoc
    ol.htm
  • Virginia Tech DLRL OAI Projects
  • http//www.dlib.vt.edu/projects/OAI/
  • Repository Explorer
  • http//purl.org/net/oai_explorer
  • NDLTD
  • http//www.ndltd.org

88
More Links
  • ARC Cross-Archive Search Service
  • http//arc.cs.odu.edu/
  • XML Schema Validator
  • http//www.w3.org/2001/03/webdata/xsv
  • Dublin Core Metadata Initiative
  • http//www.dublincore.org
  • E-Prints DL-in-a-box
  • http//www.eprints.org
  • XML Tools at W3C
  • http//www.w3.org/XML/software
About PowerShow.com