Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis

Description:

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 40
Provided by: Akeish6
Learn more at: https://www.oclc.org
Category:

less

Transcript and Presenter's Notes

Title: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis


1
Publisher Name Authority Project An Attempt to
Enhance Data Mining for Collection Analysis
Comparison
  • Lynn Silipigni Connaway
  • Consulting Research Scientist III
  • Akeisha Heard
  • Technical Intern
  • XXV Annual Charleston Conference
  • 04 November 2005

2
Introduction
3
Research Goals
  • Develop a service to support advanced collection
    intelligence
  • Cluster collected objects based on their issuing
    entity
  • As can be determined via metadata about the
    objects
  • Gain intelligence about the nature of individual
    publishers
  • Collection intelligence
  • Acquisition patterns
  • User behavior

4
Research Objectives
  • Resolve
  • ISBN prefixes to publisher name
  • Variant publisher names to a preferred form
  • Capture and make available for use various
    attributes of individual publishers
  • Location of publisher
  • Language(s) of materials published
  • Genre(s)/format(s) of materials published
  • Dominant subject domain(s) of the publisher's
    output
  • Parent company and subsidiaries

5
Theoretical Foundation Authority Control
  • Adhere to authorized form
  • Personal names
  • Corporate entities
  • Why no authorized form for publishing entities?

6
Pragmatic Foundation Collection Development
  • Identified publisher series
  • Retrospective conversion project (1984)
  • Family tree
  • Which publishers are related?
  • Approval plans
  • Which publishers publish which subjects?

7
Pragmatic Foundation OCLC WorldCat Data Mining
  • Collection Analysis
  • Which libraries have the most items by a
    publisher in a particular subject area?
  • How do library holdings by publisher compare?
  • E-books for a particular STM publisher (2000)
  • Cataloged as reproductions
  • 2 publishers!

8
Pragmatic Foundation Citation Analysis
  • Sweetland (1989)
  • Reader functions of citations
  • Information retrieval via citation databases
  • Document retrieval
  • Includes interlibrary loan verification
  • Bibliometrics
  • Faculty and researcher productivity measure
  • Other functions
  • Creation of references/bibliographies

9
Pragmatic Foundation Education for Librarians
  • Collection development acquisitions librarian
    education
  • Subject focuses of publishers
  • Parent and subsidiary relationships

10
Specialized Corporate Authority Files
  • ACOLIT (Ruggeri, 2004)
  • Names, uniform titles, Italian and international
    Catholic institutions, Catholic religious
    communities, and institutions
  • Related to the Catholic Church, Papal State, and
    Vatican City State
  • COPAR (Boddaert, 2004)
  • French official corporate bodies
  • Mainly national and preceding the French
    Revolution
  • CORELI (Boddaert, 2004)
  • Religious corporate bodies from 3 French ancient
    specialized catalogues

11
Specialized Corporate Authority Files
  • Chinese Modern Author Authority Database (Hu, Tam
    Lo, 2004)
  • Chinese authors of expanded works and Chinese
    corporate bodies since 1912
  • Chinese Name Authority Database (Hu, Tam Lo,
    2004)
  • Mainly Taiwanese personal names with some
    Taiwanese corporate bodies

12
Specialized Corporate Authority Files
  • Case study by Elias Fair (1983)
  • Standard Oil Co.s Media Query File
  • No authority control
  • 3 professionals in 6 months averaged 12 telephone
    calls/day from reporters
  • Decided against canonical list for media names
  • Noted 20 unique variants for Wall Street Journal
    including WSJ, Wall St. Jnl, Wall Street Jnl

13
Specialized Corporate Authority Files
  • Case study by French, Powell Schulman (1997,
    2000)
  • Smithsonian Astrophysical Observatorys
    Astrophysics Data System database
  • Programmatically identify author affiliations and
    map variant names to canonical name
  • Investigated various techniques separately and
    iteratively to bring variants together including
  • Lexical cleanup
  • Data clustering algorithms
  • Approximate string-matching
  • Reduced number of unique strings by 55
  • Required manual review of clusters

14
Database Quality
15
Literature Database Quality
  • Review by ONeill Vizine-Goetz (1988)
  • Busch (1981)
  • lt 35 of 141 OCLC libraries routinely reported
    errors
  • Pollock Zamora (1983)
  • Noted misspellings comprise 90-96 of errors
    include
  • Omission
  • Insertion
  • Substitution
  • Transposition

16
Literature Database Quality
  • Intner (1989)
  • Reviewed 215 matching records in OCLC and RLIN
  • Errors relating to publishers

OCLC OCLC RLIN RLIN
Count (Total) Count (Total)
Application of AACR2 LCRI 64 (205) 31.2 52 (191) 27.2
MARC tagging in 260 field 4 (25) 16.0 3 (26) 11.5
Typographic errors 4 (32) 12.5 6 (45) 13.3
17
Literature Database Quality
  • Romero (1994)
  • Evaluated cataloging of library science students
  • Noted 221 errors (28.22) in the publisher
    description area

18
Issues Historical Practices
  • Different rules for abbreviations
  • LC Rule Interpretation B.14
  • State postal (2-letter) abbreviation if it
    appears in the item along with the place
  • Anglo-American Cataloguing Rules, Revised (2002)
  • Abbreviations included in Appendix B.14

19
Issues Historical Practices
  • ALA Catalog Rules (1941)
  • Multiple places of publication and publishers and
    neither or first is prominent
  • Include first listed first, indicate omission
  • Multiple places of publication and publishers and
    first is not prominent
  • Include prominent first
  • Include first listed second
  • Unknown place of publication n.p.

20
Issues Historical Practices
  • Anglo-American Cataloging Rules (1967)
  • Multiple places of publication and publishers and
    neither or first is prominent
  • Include first listed only, omit others
  • Multiple places of publication and publishers and
    first is not prominent
  • Include prominent only, omit others
  • Unknown place of publication n. p.

21
Issues Historical Practices
  • Anglo-American Cataloguing Rules, Revised (2002)
  • Multiple places of publication and publishers and
    neither or first is prominent
  • Include first listed only, omit others
  • Multiple places of publication and publishers and
    first is not prominent
  • Include first listed first
  • Include prominent second
  • Unknown place of publication S.l.

22
Issues Historical and Local Practices
  • u.a.
  • At least one German institution uses u.a. as
    mark of omission
  • Means et al.
  • Not an AACR2r rule
  • Local practice?
  • Is local practice/policy an error?

23
Issues Historical and Local Practices
  • WorldCat enhanced records
  • Eliminate or lessen the probability of these
    issues

24
Examining Quality of WorldCat
25
WorldCat Publisher Name Selection Criteria
  • Fixed field lang eng

26
WorldCat ISBN Validation Errors
  • WorldCat records with ISBNs 22.69

27
WorldCat ISBN Validation Errors
English Language English Language English Language English Language
Valid 7,561,445 99.90
Invalid 7,600 0.10
All Languages All Languages All Languages All Languages
Valid 13,147,325 99.88
Invalid 15,654 0.12
28
WorldCat MARC Tagging Errors
  • Examined English language records based on some
    known issues and manual evaluation
  • Total MARC tagging errors found 11,874 (0.03)

29
WorldCat MARC Tagging Errors
  • MARC 260 vs 300 tagging
  • In 260 field, information from 300 field in a,
    b, c and/or e
  • Dates tagging
  • Date in a or b
  • Five digit year
  • cm follows year

30
WorldCat Typographical Errors
  • Used Typographical Errors in Library Databases
    to identify and quantify English language
    WorldCat errors (Ballard, 2005)
  • Total errors 26,599 (0.08)
  • Require manual examination to determine if actual
    errors
  • Searching for Institi
  • Misspelled
  • American Institite of Physics
  • British Standards Institition
  • Spelled correctly
  • Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin
    Institute for Advanced Studies)

31
WorldCat Typographical Errors
  • Top words (10.4)

Word Probability According to Ballard Error Type WorldCat Count
Worchester Highest Insertion 398
Metheun High Transposition 355
Universt Highest Omission 299
Unives Highest Omission 275
Westminister and Press Highest Insertion 266
Niagr High Omission 260
Phildel High Omission 235
Tallahasee High Omission 234
John Hopkins Press Highest Omission 227
Institi High Substitution 226
32
WorldCat Typographical Errors
  • Westminister
  • Only included on Ballard list in combination with
    other words
  • Total errors in WorldCat 628 (2.36)
  • Require manual review

33
Where are we now?
34
WorldCat MARC 260 Evaluation
  • Top 10 terms in 260 b in WorldCat

Term Count
press 2,094,111
co 1,664,005
university 1,550,435
dept 1,084,647
pub 984,234
research 853,954
service 710,314
institute 660,346
office 649,794
chu ban she 620,735
35
WorldCat MARC 260 Evaluation
  • University Press names in 260 b in WorldCat

Term Count
oxford 35,804
hopkins 22,564
cambridge 21,951
harvard 17,069
cornell 11,305
stanford 10,900
purdue 5,468
yale 5,076
princeton 4,746
rutgers 3,854
36
Clustering
  • Attempting programmatic clustering of publishers
    using ISBN prefixes
  • Data clustering (The Free Dictionary)
  • "The science of extracting useful information
    from large data sets or databases"
  • Classification of similar objects into different
    groups
  • Partitioning of a data set into
    subsets (clusters)
  • Data in each subset (ideally) share some common
    trait

37
WorldCat Clustering Example
  • Used ISBN prefix 019 (Oxford University Press)
  • Total WorldCat records 58,004,317
  • Records with ISBN prefix 019 84,276 (0.15)
  • Non-unique publisher names from ISBN prefix
    records 91,528

One or more 019 ISBN All 019 ISBNs
NACO normalized unique publisher names 1,550 1,386
Number of clusters 919 799
Non-singleton clusters 222 (24.16) 205 (25.66)
Largest cluster 82 text strings 81 text strings
38
Challenges Publisher Name Authority File
  • Quality issue
  • Level of acceptance for cluster
  • What is acceptable?
  • Subsidiaries and Relationships
  • Oxford Auckland
  • Examined manually to determine relationship
  • Form of name
  • What is acceptable?
  • Likely to use the most prominent form of name

39
Questions and Discussion
  • Contact Information
  • connawal_at_oclc.org
  • hearda_at_oclc.org
  • Project Web Site
  • http//www.oclc.org/research/projects/publisherns/
Write a Comment
User Comments (0)
About PowerShow.com