From Data to Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

From Data to Discovery

Description:

... Lint module with many alterations Output Mechanism Connects to database using Perl DBI Retrieves ... installed on PC with Voyager More versatile Can t be ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 49
Provided by: pd63
Learn more at: https://igelu.org
Category:
Tags: data | discovery | lint

less

Transcript and Presenter's Notes

Title: From Data to Discovery


1
From Data to Discovery
  • Building Automated Cataloguing Tools with Perl

Huw Jones Cambridge University Library
2
(No Transcript)
3
Cambridge
Small city, big University lots of libraries!
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
Lots of libraries lots of books
9
Bibliographic records
  • University Library 3.85 M
  • Other libraries 2.5 M
  • 8 databases

10
Data problems
  • Quality
  • Duplication

11
Quality - fullness
of 2.5 M records in our databases
1 M are short records
12
Quality coding
13
Duplication
14
Effects
  • Difficulty in resource discovery
  • Patchy retrieval
  • Lack of authority control
  • Difficulty with standard deduplication
  • Burden on staff time
  • Ties us to multiple database model

15
Aims
  • Better records
  • Fewer records

16
Existing Solutions?
  • Manual recataloguing
  • Commercial solutions
  • Universal catalogue
  • Discovery layer
  • Either dont solve the core problem, or expensive
    and/or time consuming

17
Our solution
  • Automated Cataloguing Tools!
  • Short record enrichment
  • Automated MARC correction
  • Deduplication
  • Order important full, well coded records are
    easier to deduplicate

18
General principles
  • Retrieve some records from a Voyager database
  • Examine and/or manipulate them
  • If necessary, make changes in the database
  • N.B. Watch indexes and table space!

19
General tools
  • Perl holds everything together
  • Perl DBI connects to databases
  • SQL retrieves records from database
  • MARCRecord modules (from CPAN) to
    examine/manipulate records
  • Pbulkimport/Batchcat to make changes to the
    database

20
Batchcat vs Pbulkimport
  • Batchcat installed on PC with Voyager
  • More versatile
  • Cant be used on server
  • Pbulkimport limited functionality
  • Needs Bibliographic Detection Profile and Bulk
    Import Rule (SYSADMIN)
  • Can be used on server

21
Books
  • Learning Perl / Randal L. Schwartz and Tom
    Phoenix. 3rd ed. (Sebastopol, Calif. OReilly,
    2001). ISBN 0596001320
  • Programming the Perl DBI / Alligator Descartes
    and Tim Bunce. (Sebastopol, Calif. OReilly,
    2000). ISBN 1565926994

22
Enriching short records
  • How to get from this

23
  • to this

24
Basic mechanism
  • Take short record
  • Find a matching full record
  • Overlay short record with full record
  • Need a source of full records
  • In Cambridge - University Library - large
    database of full, authority controlled records

25
File of SHORT RECORD bib ids
Connects to LOCAL database and checks if a valid
bib id
Connects to EXTERNAL source. Finds best FULL
RECORD match and scores it
Retrieves SHORT RECORD info from local database
Compares match score to overlay threshold. If OK,
retrieves MARC record for FULL RECORD
Corrects FULL MARC record. Removes inappropriate
fields. Inserts fields to be retained from SHORT
RECORD
In local database overlays SHORT RECORD with FULL
RECORD
26
Output
27
Interface
28
Results
  • Service has been running for 1 year (much of
    which was testing)
  • 18 libraries subscribed to use service
  • 90,000 short records upgraded

29
MARC checking and correction
  • Bibliographic standard agreed minimum standard
    for cataloguing
  • Every week, libraries receive an automatically
    generated file of MARC coding errors for
    correction
  • Based on MARCLint module with many alterations

30
Output
31
Mechanism
  • Connects to database using Perl DBI
  • Retrieves MARC record for records created/edited
    in last week
  • Runs them through MARC check
  • Prints errors to file
  • Emails file to library
  • Over 100,000 errors pointed out so far!

32
MARC Correction
  • How to get from this
  • LDR 00472nam\\2200157\a\4500
  • 001 662002
  • 005 20071205064734.0
  • 008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
  • 020 \\a9780961751111
  • 100 1\aBroecker, W.S.,d1931-
  • 245 10aHow to build a habitable planet cBy
    Wallace S. Broecker.
  • 260 \\aNew York bEldigio Press,cc1985
  • 300 \\a291p bill c23cm
  • 504 \\aIncludes index.
  • 650 \0aAstronomy.
  • 650 \0aAstrophysics.

33
  • to this!
  • LDR 00453nam 2200157 a 4500
  • 001 662002
  • 005 20071205064734.0
  • 008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d
  • 020 \\a9780961751111
  • 100 1\aBroecker, W. S.,d1931-
  • 245 10aHow to build a habitable planet /cby
    Wallace S. Broecker.
  • 260 \\aNew York bEldigio Press,cc1985.
  • 300 \\a291 p. bill. c23 cm.
  • 504 \\aIncludes index.
  • 650 \0aAstronomy.
  • 650 \0aAstrophysics.

34
MARC Correction
  • Version of module which, where there is no
    ambiguity, corrects errors
  • Built into short record upgrade program
  • Also offered as a retrospective service to clean
    up legacy records
  • Possibility of building it into weekly check

35
Mechanism
  • Connects to database using Perl DBI
  • Retrieves full MARC record
  • Runs against correction module
  • Replaces corrected record in database

36
Output
  • Bib id 662002
  • How to build a habitable planet By Wallace S.
    Broecker.
  • 100 UPDATE Spaces inserted between initials in
    subfield _a
  • 245 UPDATE By uncapitalised at start of
    subfield c
  • 245 UPDATE Space forward slash inserted before
    subfield _c
  • 260 UPDATE Full stop inserted at end of field
  • 260 UPDATE Space colon inserted before subfield
    _b
  • 300 UPDATE Full stop inserted after the p in
    pagination
  • 300 UPDATE Full stop inserted at end of field
  • 300 UPDATE Illustration abbreviation has been
    corrected
  • 300 UPDATE Space colon inserted before subfield
    _b
  • 300 UPDATE Space inserted between digits and cm
  • 300 UPDATE Space inserted between digits and p
    in pagination
  • 300 UPDATE Space semi-colon inserted before
    subfield c

37
Results
  • In testing 70,000 records processed
  • Corrected over 200,000 MARC coding errors
  • May run ALL our existing records through at some
    stage

38
Deduplication in progress!
  • Three stages
  • Identification of groups of duplicates
  • Identification/construction of best record
  • Deletion of other records relinking of
    holdings/items/Purchase Orders to best record

39
Identification of duplicates
  • Connect to a database with Perl DBI
  • Use SQL to retrieve records
  • For each record, retrieve all available data from
    tables
  • Use matching algorithm to identify groups of
    duplicates

40
  • And youll end up with something like this

41
Identification of best record
  • For each of group of duplicates, MARC records
    retrieved
  • Passed to scoring algorithm
  • Record with highest score forms basis of best
    record
  • Retains set fields (i.e. subject headings) from
    other records
  • Corrects any MARC coding errors

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
But
  • No relinking functionality, even in BatchCat
  • No viable workaround for libraries using
    Acquisitions/without losing circulation history

47
In conclusion
  • Tools for librarians, not replacements!
  • Do the stuff programs do well, allowing humans to
    concentrate on what humans do well
  • Wont do all the work, just makes a solution to
    major data problems feasible

48
Questions?
Write a Comment
User Comments (0)
About PowerShow.com