Title: Extracting XML from Unicorn with OAI and SRU
1Extracting XML from Unicorn with OAI and SRU
- European Unicorn User Group Conference
- Glasgow Caledonian University
- September 7th 8th, 2006
Benoit PAUWELS Université Libre de Bruxelles
(ULB) Brussels
2Agenda
- Introduction Unicorn interfaces
- Part 1 An OAI frontend for Unicorn
- Part 2 An SRU frontend for Unicorn
- Short description of OAI and SRU protocols
- Overview of technical implementation
- Use cases and demos
3Introduction
- OAI and SRU are open protocols that permit
exchange of metadata between information systems - Well-known Unicorn interfaces
- Unicorn API server
- Unicorn Webcat/iBistro/iLink server
- Unicorn Z39.50 server
- All comply to the philosophy of request/response
sequences
4Unicorn interfaces API server
Catalogue database Records and indexes
API server
TCPIP/Socket API request
- SirsiDynix
- Character client
- C Workflows client
- Java Themes client
TCPIP/Socket API response API datacodes/values
Client system
Unicorn server
Communication protocol TCPIP/Socket Information
exchange protocol proprietary SirsiDynix API
requests/responses Returned record
structure proprietary SirsiDynix format
(data-codes and -values)
5Unicorn interfaces iLink
Catalogue database Records and indexes
iLink
Web Server
HTTP iLink request (URL)
HTTP HTML page HTML
Client system
Unicorn server
Communication protocol HTTP Information exchange
protocol URL requests / HTML responses Returned
record structure HTML
6Unicorn interfaces Z39.50
Catalogue database Records and indexes
Z39.50
Z39.50 Z39.50 request
Z3950 Z3950 response MARC21
Client system
Unicorn server
Communication protocol Z39.50 specific Informatio
n exchange protocol Z39.50 specific Returned
record structure typically MARC21
7Unicorn interfaces
- API Proprietary
- low interoperability level
- HTML Record data not well structured
- low reusability level
- Z39.50 Protocol specific
- more difficult to implement (high learning curve)
- Z39.50 is statefull
- ?Difficult to integrate into todays web services
environments - ?communication use HTTP
- ?information exchange use open protocols (like
OAI and SRU) - ?record data structure use XML (according to
well-defined XML Schema)
82 new Unicorn interfaces
- HTTP / Open / XML
- OAI-PMH Open Archives Initiative Protocol for
Metadata Harvesting - SRU Search and Retrieve via URL
9OAI-PMH the protocol
Document Archive
OAI Frontend
Web Server
HTTP embedded OAI requests
HTTP embedded OAI responses
Service Provider
Data Provider
10OAI-PMH the protocol
- Harvester collects metadata from archives
- Stateless protocol sequence of OAI
requests/responses over HTTP - Just harvesting -- NOT searching
11OAI-PMH the protocol
- OAI requests
- HTTP GETPOST requests
- Syntax
- BASE URL
- host port path of OAI request handler
- keyvalue pairs
- Examples
- http//www.cible.ulb.ac.be80/cgi-bin/OAI20/catal
og?verbIdentify _ - http//www.biomedcentral.com/oai/1.1/bmcoai.asp?
verbGetRecordidentifieroaibmc1471-2105-1-1me
tadataPrefixoai_dc
12OAI-PMH the protocol
- OAI responses
- XML encoded bytestreams, containing the records
- Record triplet
- header (unique OAI identifier)
- metadata
- about
- Metadata schemes
- XML Schema
- Minimum unqualified Dublin Core
- Community specific
- Example of a record (catkey 450000 from ULB
catalogue) - oai_dc marc21 umods
13OAI-PMH the protocol
- Simple 6 OAI requests/responses
- Identify
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?v
erbIdentify _ - ListMetadataFormats identifier
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?v
erbListMetadataFormats _ - ListSets
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?v
erbListSets _ - GetRecord identifier, metadataPrefix
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbGetRecordidentifieroaiulbcat245000metada
taPrefixmarc21 _
14OAI-PMH the protocol
- Simple 6 OAI requests/responses
- ListRecords metadataPrefix, from,until,set
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListRecordsmetadataPrefixoai_dc _ - http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListRecordsmetadataPrefixmhld21setelper
_ - http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListRecordsmetadataPrefixmarc21from2006-0
8-01 _ - ListIdentifiers metadataPrefix, from,until,set
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListIdentifiersmetadataPrefixoai_dc _
15OAI frontend for Unicorn
- Implementation of the data provider functionality
(2001) - http//www.openarchives.org/tools/tools.htmlpick
a template and interface with Unicorn through
Unicorn database tools - Our choice Object Oriented Perl frontend (H.
Suleman Virginia Tech) _
16OAI frontend for Unicorn
17OAI frontend for Unicorn
- Example implementation of the GetRecord request
- http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbGetRecordidentifieroaiulbcat245000metad
ataPrefixoai_dc - 1. Get metadata from Unicorn for catkey 245000
- record echo catkey catalogdump -of
filtermarc -iALL -od -Ds _ - _at_dates split(\,echo catkey selcatalog
-iK -opr) - 2. Convert ANSEL character set into ISO-LATIN-1
- 3. Map from MARC to oai_dc _
- 4. Format into XML
18OAI frontend for Unicorn
- Example implementation of the set parameter of
the ListRecords request - http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListRecordsmetadataPrefixoai_dcsetelper - Precompile set as a file of catkeys
- name of file name of set_catkeys
- einstein_albert_catkeys
- elper_catkeys
- sd_catkeys
- all_catkeys
- through periodic execution of mkoaisets
custom report
19OAI frontend for Unicorn
- Example implementation of the from/until
parameters of the ListRecords request - http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbListRecordsmetadataPrefixoai_dcfrom2006-
08-01until2006-08-31 - BRS index on creation/modification date?
- Every Unicorn record that gets created or
modified is touched in the textedit and
browsedit directories - Custom report cadutext
- saves catkeys to ltudgt/Savedkeys/adutext/rptid
- adds line rptiddatestatus to
ltudgt/Lastruns/cadutext - Example from2006-08-01until2006-08-31
- obtain report ids for all runs of cadutext after
2006-08-01 and before 2006-08-31 from the file
ltudgt/Lastruns/cadutext - for each of these report ids obtain catkeys from
ltudgt/Savedkeys/adutext/rptid and save them to
randomnumber_catkeys file - sort and uniq the randomnumber_catkeys file
20OAI frontend for Unicorn
- Limitations of implementation
- ListRecords/ListIdentifiers
- The from and until parameters are not permitted
if the set parameter is given on the request - The from and until parameters are permitted if
the set parameter is not given on the request,
but their values should fall within a certain
date range (at this moment arbitrarily set to
today - 2 months and today) - Deleted records
- Complete source code and documentation available
on the API Repository (http//sirsiapi.org) -
21OAI frontend - use cases _at_ ULB
Use case 1 Vlink - OpenURL resolver
system joint project with Vrije Universiteit
Brussel (VUB)
22(No Transcript)
23OAI frontend - use cases _at_ ULB
- Use case 1 Vlink - OpenURL resolver system
- OpenURL sent from iLink
- http//bibdev.vub.ac.be/cgi-bin/openurlulb?
sidULBWebcatidoaiulbcat617924 - This OpenURL does not contain enough metadata for
the specific item gt Vlink does a fetch back to
Unicorn through an OAI GetRecord request to
obtain a full MARC21 bibliographic description - http//www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verbGetRecordidentifieroaiulbcat617924metad
ataPrefixmarc21
24OAI frontend - use cases _at_ ULB
- Use case 1 Vlink - OpenURL resolver system
- Feed Vlink Knowledge Base through OAI harvesting
25OAI frontend - use cases _at_ ULB
- Use case 2 Unicat - Virtual Union Catalog of
Belgium
26SRU the protocol
SRU Frontend
Web Server
Catalogue database Records and indexes
HTTP SRU request
HTTP SRU response XML
Client System
Unicorn Server
Communication protocol HTTP Information exchange
protocol SRU Returned record structure XML
27SRU the protocol
- Client searches and retrieves metadata records
from an archive - Stateless protocol sequence of SRU
requests/responses over HTTP - Search and Retrieve (lt-gt OAI harvesting)
28SRU the protocol
- SRU requests
- HTTP GET requests
- Syntax
- BASE URL
- host port path of SRU request handler
- keyvalue pairs
- 3 possible requests (operations)
- explain
- serves to record facilities available at an SRU
server - used by clients to self-configure
- returned explain record is in XML and follows the
ZeeRex Schema - Example http//z3950.loc.gov7090/voyager?version
1.1operationexplain _ - scan
- allows the client to request a range of the
available terms at a given point within a list of
indexed terms - enables clients to present an ordered list of
values and, if supported, how many hits there
would be for a search on that term - searchRetrieve
29SRU the protocol
- searchRetrieve operation
- searchRetrieve (principal) parameters
- Version (of the request) current protocol
version 1.1 - query query expressed in CQL
- startRecord position within the sequence of
matched records of the first record to be
returned - maximumRecords number of records requested to be
returned - recordSchema schema requested for the records to
be returned - stylesheet URL for an xml stylesheet. The client
requests that the server simply return this URL
in the response. - CQL
- Traditionally, query languages have fallen
into two camps Powerful, expressive languages,
not easily readable nor writable by non-experts
(e.g. SQL, PQF, and XQuery)or simple and
intuitive languages not powerful enough to
express complex concepts (e.g. CCL and google).
CQL tries to combine simplicity and intuitiveness
of expression for simple, every day queries, with
the richness of more expressive languages to
accomodate complex concepts when
necessary. (http//www.loc.gov/standards/sru/cq
l)
30SRU the protocol
- searchRetrieve operation
- Examples of CQL queries
- dinosaurtitle "complete dinosaur"title exact
"the complete dinosaur"dinosaur not reptile
dinosaur and bird or dinobird publicationYear lt
1980 - title all "complete dinosaur"
- title contains all of the words complete, and
dinosaur - title any "dinosaur bird reptile"
- title contains any of the words dinosaur,
bird, or reptile - ribs prox/distancelt5 chevrons
- a more specific proximity query ribs within 5
words of chevrons
31SRU the protocol
- searchRetrieve operation -- examples
- http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievequeryauthoreinstein _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1queryauthoreinstein _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1queryauthoreinsteinrecordSchemadc _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1queryauthor all "einstein albert _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1querytitle all "einstein albert _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1querytitle all "einstein albertstylesheetht
tp//bib49.ulb.ac.be/cibleCanevas.xsl _ - http//bib49.ulb.ac.be9000/Cible?version1.1oper
ationsearchRetrievemaximumRecords10startRecord
1querytitle all "einstein albertstylesheetht
tp//bib49.ulb.ac.be/cibleTypo3.xsl _
32SRU frontend for Unicorn
SRU Frontend
Web Server
Catalogue database Records and indexes
HTTP SRU request
HTTP SRU response XML
Unicorn Server
Client system
33SRU frontend for Unicorn
Z39.50 Frontend
Catalogue database Records and indexes
SRU/Z39.50 Gateway
Web Server
HTTP SRU request
Z3950 Z3950 request
HTTP SRU response XML
Z3950 Z3950 response
Unicorn Server
SRU/Z39.50
Client system
34SRU frontend for Unicorn
- SRU/Z39.50 Gateway YAZ Proxy (Index Data)
- Implemented at ULB 7/2006 (2 days)
- config.xml
- lttarget name"cible" default"1"gt
- lturlgtbib7.ulb.ac.be2200lt/urlgt
- ltxiinclude href"explain.xml"/gt
- ltcql2rpngtpqf.propertieslt/cql2rpngt
- lt/targetgt
- lttarget nameslavko" default"1"gt
- lturlgtvelma.library.mun.ca2200lt/urlgt
- ltxiinclude href"explain.slavko.xml"/gt
- ltcql2rpngtpqf.slavko.propertieslt/cql2rpngt
- lt/targetgt
- explain.xml
- ZeeRex XML record as response to explain
operation - pqf.properties
- specifies the mapping of various CQL indexes,
relations, etc. into Type-1 query attributes
35SRU frontend for Unicorn
- YAZ Proxy
- http//bib49.ulb.ac.be9000/Cible?version1.1ope
rationsearchRetrievemaximumRecords10startRecor
d1querytitle all "einstein albertstylesheet
http//bib49.ulb.ac.be/cibleTypo3.xsl _ - http//bib49.ulb.ac.be9000/Slavko?version1.1op
erationsearchRetrievemaximumRecords10startReco
rd1querytitle all "einstein
albertstylesheethttp//bib49.ulb.ac.be/cibleTy
po3.xsl _
36SRU frontend use case _at_ ULB
- Seamless integration of catalog searches in CMS
- Typo3
- Example
- HTML page containing biography of famous belgian
historian Henri Pirenne - frame pointing to the following URL
- http//bib49.ulb.ac.be9000/Cible?
version1.1operationsearchRetrievemaximumRecord
s10startRecord1querypirenne20and20epub-dnu
-stylesheethttp//bib49.ulb.ac.be/cibleTypo3.x
sl - Project
- Unicorn contains descriptions of databases,
websites, etc with local thematic classification
codes in 653 - create thematic websites within our CMS,
containing frames that list available databases
per theme