A Radioactive Metadata Record Approach for Interoperability Testing Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

A Radioactive Metadata Record Approach for Interoperability Testing Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs

Description:

Benchmarks. Results of test searches against reference implementations ... Benchmarks. Compared to. Moen. CNI Task Force Meeting -- Washington, DC -- April 2005 ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 30
Provided by: willia81
Learn more at: https://courses.unt.edu
Category:

less

Transcript and Presenter's Notes

Title: A Radioactive Metadata Record Approach for Interoperability Testing Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs


1
A Radioactive Metadata Record Approach for
Interoperability TestingUse of Special
Diagnostic Records in the Context of Z39.50 and
Online Library Catalogs
Coalition for Networked Information Task Force
Meeting, April 2005, Washington, DC
  • William E. Moenltwemoen_at_unt.edugtSchool of
    Library and Information SciencesTexas Center for
    Digital KnowledgeUniversity of North
    TexasDenton, TX 72603

2
Overview
  • Interoperability
  • Radioactive MARC records and their power
  • MARC content designation utilization

3
IMLS funded projects
  • Z39.50 Interoperability Testbed, Phases 1 2
  • Improve Z39.50 semantic interoperability among
    libraries for information access and resource
    sharing
  • Establish and operate a testbed for interop
    testing of Z39.50 clients and servers with
    library catalogs (Phase 1)
  • Explore alternative approach using radioactive
    MARC records (Phase 2)
  • MARC Content Designation Utilization
  • Provide empirical evidence of MARC content
    designation use
  • Explore the evolution of MARC content designation
  • Develop methodological approach to understand the
    factors contributing to current levels of MARC
    content designation use

4
Factors affecting interoperability
  • Multiple and disparate systems
  • Information retrieval systems, search
    functionality, etc.
  • Multiple protocols
  • Z39.50, HTTP, SOAP, SRW/Uetc.
  • Multiple data formats, syntax, metadata schemes
  • MARC 21, UNIMARC, XML, ISBD/AACR2-based, Dublin
    Core
  • Multiple vocabularies, ontologies, disciplines
  • LCSH, MESH, AAT
  • Multiple languages, multiple character sets
  • Indexing, word normalization, and word extraction
    policies

5
Levels of Z39.50 interoperability
  • Low-level protocol (syntactic)
  • Do Z-client and Z-servers interchange protocol
    messages according to standard?
  • High-level protocol (functional)
  • Do Z-client and Z-servers support appropriate
    Z39.50 services for user tasks?
  • Semantic level
  • Can Z-clients and Z-servers and local IR systems
    preserve and act on meaning of IR tasks?
  • User Task level
  • Do systems support IR tasks of one or more user
    groups?

6
Threats to interoperability
  • Differences in implementation of the standard
  • Differences in local information retrieval
    systems
  • Search functionality
  • Indexing policies
  • These threats can be addressed by
  • Z39.50 specifications and configuration (e.g.,
    profiles)
  • Enhancing local information retrieval systems
  • Recommendations for local indexing decisions

7
Z-Interop Phase 1
  • Test dataset
  • 400,000 MARC 21 records from OCLC
  • Z39.50 reference implementations
  • Z-client, Z-server, information retrieval system
  • Configured to the profile specifications
  • Test scenarios searches
  • Searches with known result records from dataset
  • Benchmarks
  • Results of test searches against reference
    implementations

FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE
http//www.unt.edu/zinterop/
8
Phase 1 interop testing
Test Dataset Loaded by Vendor or Library
Reference Z39.50 Client
VendorZ39.50 Server
Configuredby Vendorfor Conformance to Profile
Configuredto SupportProfileSpecifications
Indexed by Vendor According to
Vendors Specifications
Test Searches
RetrievalResults
RetrievalBenchmarks
Compared to
9
Z-Interop Phase 2
  • The specially designed MARC records, Radioactive
    MARC Records
  • Concept coined by Sebastian Hammer, Index Data
  • Records will be publicly available, possibly
    through OCLC
  • A set of test searches and automatic testing
    script that issues searches, retrieves records,
    and develops reports on the search and retrieval
    results
  • Developed by Index Data
  • Will be released under GPL
  • A database of MARC documentation that enables the
    automatic identification of types of searches to
    issue
  • Developed by UNT
  • MARCdocs Database

10
(No Transcript)
11
Radioactive MARC records
  • Specially designed diagnostic records
  • Legitimate instance of MARC record structure
  • Fields/subfields contain content-rich tokens
  • A token is a string of characters that has a
    specific structure and semantics that will serve
    as words or other data values in specific
    fields/subfields.
  • Multiple sets of RadMARC records, distinguished
    by the amount of content designation populated

12
Structure of RadMARC tokens
  • A single alpha character for left-hand padding.
  • Value r
  • A single alpha character to indicate the format
    of the material being described or type of record
  • Value Selected values as defined in MARC
    Leader/06 Type of Record or the Leader/07
    Bibliographic Level
  • Three numbers indicating the Field Tag
  • Value Defined in MARC 21 specifications
  • A single integer to indicate number of occurrence
    the Field Tag
  • Value Sequential number starting with 1
  • A single alpha character to indicate the
    Subfield Code
  • Value Defined in MARC 21 specifications
  • A single integer indicating the offset within
    subfield
  • Value Use the following scheme 1first token
    in subfield, 2second token in subfield 3 third
    token in subfield, etc.
  • A single alpha character for right-hand padding
  • Value r

13
Token example
  • ra2451a1r
  • r - Left-hand padding
  • a - Type of record -- this is a books type record
  • 245 - Field code
  • 1 First occurrence of field in record
  • a - Subfield code
  • 1 - Offset within subfield, where 1 first token
    in subfield
  • r - Right-hand padding
  • RadMARC example record

14
Test scripts
  • Automate interoperability testing and reporting
  • Test searches defined by Bath Profile and US
    National Z39.50 Profile
  • RadioMARC Perl module
  • Automatically generates Z39.50 queries with
    tokens as search terms
  • Sends searches to target servers known to contain
    copies of specific records
  • Generates reports dependent on whether or not the
    expected records are present in the result set
  • Sample output of testing

15
MARCdocs database
  • Pilot effort aimed at structuring MARC 21
    documentation into a relational database
  • Stores information about all content designation
    available in the MARC 21 Format for Bibliographic
    Data specifications
  • Stores additional information about
    profile-defined searches necessary to the
    automatic test scripts
  • Implementation uses MySQL and PhP
  • Example display from MARCdocs

16
Question space for Z-Interop2
  • Profile conformance level Addresses the
    interoperability between the Z-client and
    Z-server
  • Information retrieval (IR) system level
    Addresses the capability of the IR system
    underlying the online catalog application
  • Metadata record level Concerned with how the IR
    system indexes fields in the metadata record
  • Data content level Addresses normalization of
    data, hyphenated words, special characters and
    diacritics, etc.

17
RadMARC record sets
  • What content designation should be populated in
    RadMARC records to support interoperability
    testing?
  • MARC 21 defines approximately 2,000 structures
    for holding data
  • Z-Interop2 approach
  • Develop multiple RadMARC record sets
  • Increasing amount of content designation
    populated
  • Informed by MARC content designation analysis

18
Z-Interop test dataset
  • Approximately 1 sample of MARC records from
    OCLCs WorldCat database
  • 419,657 total MARC records
  • 89 of records full level cataloging
  • Formats represented in test dataset
  • Books 91
  • Cartographic Materials lt 1
  • Electronic resources lt 1
  • Archival/Mixed Materials lt1
  • Sound recordings 4
  • Visual Materials 1
  • Serials 3

19
MARC 21 content designation
MARC 21 Field Groups Currently Defined Obsolete Total MARC 1972 (Books Format Only)
00x 6 1 7 3
0xx 238 7 245 28
1xx 66 1 67 40
2xx 137 32 169 15
3xx 109 32 141 4
4xx 69 0 69 37
5xx 323 38 361 8
6xx 184 5 189 66
7xx 452 47 499 41
8xx 141 20 161 36
TOTAL 1725 183 1908 278
20
Fields populated in Z-Interop dataset
MARC 21 Field Groups Currently Defined Obsolete Unlikely Used Total
00x 6 0 0 6
0xx 96 1 33 130
1xx 49 0 2 51
2xx 81 0 19 100
3xx 23 6 0 29
4xx 10 0 30 40
5xx 128 1 3 132
6xx 104 1 7 112
7xx 205 0 5 210
8xx 105 3 8 116
TOTAL 807 12 107 926
21
Occurrence summary
Total number of fields/subfields occurring in
dataset 13,849,499
Frequency of Fields/Subfields of All Occurrences
gt 600,000 1 4.4
500,000 gt 599,999 0 0
400,000 gt 499,999 13 39.9
300,000 gt 399,999 6 14.3
200,000 gt 299,999 6 10.6
100,000 gt 199,999 10 10.3
TOTAL 36 79.5
Only 4 of all fields/subfields account for 80
of all occurrences or 96 of all fields/subfields
account for 20 of all occurrences
22
Characteristics of top 36
  • Most frequently occurring 650 a Subject data
  • 2nd most frequently occurring 040 d Cataloging
    source
  • 3rd 4th most frequently occurring 260 a b
    Publication information
  • 5th most frequently occurring 245 a Title
  • Contain data useful to end users 28
  • Contain control numbers, etc. 5
  • Contain data useful to catalogers 3

23
Indexing MARC
  • Indexing Guidelines to Support Z39.50 Profile
    Searches (available on Z-Interop website)
  • Identified all MARC 21 fields/subfields that can
    contain author, title, or subject data
  • Author-related fields/subfields 119
  • AuthorTitle-related fields/subfields 21
  • Title-related fields/subfields 253
  • Subject-related fields/subfields 144

24
Occurrences in test dataset
  • 537 fields/subfields can contain author, title,
    subject data
  • 381 of these actually occur in Z-Interop dataset
  • Total occurrences of the 381 4,397,712
  • 19 of the 381 (5) account for 80 of all
    occurrences
  • 9 of 19 are subject-related
  • 5 of 19 are author-related
  • 5 of 19 are title-related
  • Preliminary testing using only 19 indexed fields
  • 95 - 100 of correct records retrieved!
  • The 19 fields/subfields

25
MCDU Project
  • Systematically analyze WorldCat records
  • Provide empirical evidence of catalogers use of
    MARC content designation
  • Contribute to community discussion about core
    elements in MARC bibliographic records based on
    empirical evidence of actual use
  • Inform future sets of RadMARC records

FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE
http//www.mcdu.unt.edu
26
Initial RadMARC sets
  • Set 1
  • 10 records
  • Populate 19 most frequently occurring Author,
    Title, Subject fields
  • Distinguished by types of materials cataloged
  • Set 2
  • 4 records (100, 110, 111, 130 main entry fields)
  • Populate the Author, Title, Subject fields
    occurring 1000 or more times (approximately 50
    fields/subfields populated)
  • Set 3
  • Records populated based on
  • The LC Network Development and MARC Standards
    Office recommendations for national level records
  • The Program for Cooperative Cataloging (2003)
    core record standards.

27
Extensibility of RadMARC
  • Records can be as simple or as complex as needed
  • Custom records for a library that wants specific
    assessment of indexing or other policies to
    interrogate system behavior
  • Assess normalization of characters
  • Testing transformation from one metadata scheme
    to another
  • MARC Record
  • MARCXML Transformation
  • MODS Transformation
  • DC Transformation
  • Other metadata environments?

28
Concluding thoughts
  • Exploring an innovative conceptual and technical
    approach for interoperability testing.
  • Conducting a proof-of-concept for a radioactive
    record approach for diagnosing interoperability
    factors in an identified question space
  • Extensible in terms of the current focus
  • Creating different sets of RadMARC records to
    diagnose general or specific system and
    interoperability issues
  • Extensible to other application environments,
    metadata schemes, and protocols.

29
References
  • Z39.50 Interoperability Testbed
  • http//www.unt.edu/zinterop/
  • MARC Content Designation Utilization Project
  • http//www.mcdu.unt.edu/
  • Assessing Metadata Utilization An Analysis of
    MARC Content Designation Use
  • http//www.unt.edu/wmoen/publications/MARCPaper_Fi
    nal2003pdf.pdf
  • Indexing Guidelines to Support Z39.50 Profile
    Searches
  • http//www.unt.edu/zinterop/Documents/IndexingGuid
    elines1Feb2002.pdf
  • MARCdocs Database (public interface)
  • http//meta.lis.unt.edu/MARCdocs2
Write a Comment
User Comments (0)
About PowerShow.com