Title: Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis
1Publisher Name Authority Project An Attempt to
Enhance Data Mining for Collection Analysis
Comparison
- Lynn Silipigni Connaway
- Consulting Research Scientist III
- Akeisha Heard
- Technical Intern
- XXV Annual Charleston Conference
- 04 November 2005
2Introduction
3Research Goals
- Develop a service to support advanced collection
intelligence - Cluster collected objects based on their issuing
entity - As can be determined via metadata about the
objects - Gain intelligence about the nature of individual
publishers - Collection intelligence
- Acquisition patterns
- User behavior
4Research Objectives
- Resolve
- ISBN prefixes to publisher name
- Variant publisher names to a preferred form
- Capture and make available for use various
attributes of individual publishers - Location of publisher
- Language(s) of materials published
- Genre(s)/format(s) of materials published
- Dominant subject domain(s) of the publisher's
output - Parent company and subsidiaries
5Theoretical Foundation Authority Control
- Adhere to authorized form
- Personal names
- Corporate entities
- Why no authorized form for publishing entities?
6Pragmatic Foundation Collection Development
- Identified publisher series
- Retrospective conversion project (1984)
- Family tree
- Which publishers are related?
- Approval plans
- Which publishers publish which subjects?
7Pragmatic Foundation OCLC WorldCat Data Mining
- Collection Analysis
- Which libraries have the most items by a
publisher in a particular subject area? - How do library holdings by publisher compare?
- E-books for a particular STM publisher (2000)
- Cataloged as reproductions
- 2 publishers!
8Pragmatic Foundation Citation Analysis
- Sweetland (1989)
- Reader functions of citations
- Information retrieval via citation databases
- Document retrieval
- Includes interlibrary loan verification
- Bibliometrics
- Faculty and researcher productivity measure
- Other functions
- Creation of references/bibliographies
9Pragmatic Foundation Education for Librarians
- Collection development acquisitions librarian
education - Subject focuses of publishers
- Parent and subsidiary relationships
10Specialized Corporate Authority Files
- ACOLIT (Ruggeri, 2004)
- Names, uniform titles, Italian and international
Catholic institutions, Catholic religious
communities, and institutions - Related to the Catholic Church, Papal State, and
Vatican City State - COPAR (Boddaert, 2004)
- French official corporate bodies
- Mainly national and preceding the French
Revolution - CORELI (Boddaert, 2004)
- Religious corporate bodies from 3 French ancient
specialized catalogues
11Specialized Corporate Authority Files
- Chinese Modern Author Authority Database (Hu, Tam
Lo, 2004) - Chinese authors of expanded works and Chinese
corporate bodies since 1912 - Chinese Name Authority Database (Hu, Tam Lo,
2004) - Mainly Taiwanese personal names with some
Taiwanese corporate bodies
12Specialized Corporate Authority Files
- Case study by Elias Fair (1983)
- Standard Oil Co.s Media Query File
- No authority control
- 3 professionals in 6 months averaged 12 telephone
calls/day from reporters - Decided against canonical list for media names
- Noted 20 unique variants for Wall Street Journal
including WSJ, Wall St. Jnl, Wall Street Jnl
13Specialized Corporate Authority Files
- Case study by French, Powell Schulman (1997,
2000) - Smithsonian Astrophysical Observatorys
Astrophysics Data System database - Programmatically identify author affiliations and
map variant names to canonical name - Investigated various techniques separately and
iteratively to bring variants together including - Lexical cleanup
- Data clustering algorithms
- Approximate string-matching
- Reduced number of unique strings by 55
- Required manual review of clusters
14Database Quality
15Literature Database Quality
- Review by ONeill Vizine-Goetz (1988)
- Busch (1981)
- lt 35 of 141 OCLC libraries routinely reported
errors - Pollock Zamora (1983)
- Noted misspellings comprise 90-96 of errors
include - Omission
- Insertion
- Substitution
- Transposition
16Literature Database Quality
- Intner (1989)
- Reviewed 215 matching records in OCLC and RLIN
- Errors relating to publishers
OCLC OCLC RLIN RLIN
Count (Total) Count (Total)
Application of AACR2 LCRI 64 (205) 31.2 52 (191) 27.2
MARC tagging in 260 field 4 (25) 16.0 3 (26) 11.5
Typographic errors 4 (32) 12.5 6 (45) 13.3
17Literature Database Quality
- Romero (1994)
- Evaluated cataloging of library science students
- Noted 221 errors (28.22) in the publisher
description area
18Issues Historical Practices
- Different rules for abbreviations
- LC Rule Interpretation B.14
- State postal (2-letter) abbreviation if it
appears in the item along with the place - Anglo-American Cataloguing Rules, Revised (2002)
- Abbreviations included in Appendix B.14
19Issues Historical Practices
- ALA Catalog Rules (1941)
- Multiple places of publication and publishers and
neither or first is prominent - Include first listed first, indicate omission
- Multiple places of publication and publishers and
first is not prominent - Include prominent first
- Include first listed second
- Unknown place of publication n.p.
20Issues Historical Practices
- Anglo-American Cataloging Rules (1967)
- Multiple places of publication and publishers and
neither or first is prominent - Include first listed only, omit others
- Multiple places of publication and publishers and
first is not prominent - Include prominent only, omit others
- Unknown place of publication n. p.
21Issues Historical Practices
- Anglo-American Cataloguing Rules, Revised (2002)
- Multiple places of publication and publishers and
neither or first is prominent - Include first listed only, omit others
- Multiple places of publication and publishers and
first is not prominent - Include first listed first
- Include prominent second
- Unknown place of publication S.l.
22Issues Historical and Local Practices
- u.a.
- At least one German institution uses u.a. as
mark of omission - Means et al.
- Not an AACR2r rule
- Local practice?
- Is local practice/policy an error?
23Issues Historical and Local Practices
- WorldCat enhanced records
- Eliminate or lessen the probability of these
issues
24Examining Quality of WorldCat
25WorldCat Publisher Name Selection Criteria
26WorldCat ISBN Validation Errors
- WorldCat records with ISBNs 22.69
27WorldCat ISBN Validation Errors
English Language English Language English Language English Language
Valid 7,561,445 99.90
Invalid 7,600 0.10
All Languages All Languages All Languages All Languages
Valid 13,147,325 99.88
Invalid 15,654 0.12
28WorldCat MARC Tagging Errors
- Examined English language records based on some
known issues and manual evaluation - Total MARC tagging errors found 11,874 (0.03)
29WorldCat MARC Tagging Errors
- MARC 260 vs 300 tagging
- In 260 field, information from 300 field in a,
b, c and/or e - Dates tagging
- Date in a or b
- Five digit year
- cm follows year
30WorldCat Typographical Errors
- Used Typographical Errors in Library Databases
to identify and quantify English language
WorldCat errors (Ballard, 2005) - Total errors 26,599 (0.08)
- Require manual examination to determine if actual
errors - Searching for Institi
- Misspelled
- American Institite of Physics
- British Standards Institition
- Spelled correctly
- Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin
Institute for Advanced Studies)
31WorldCat Typographical Errors
Word Probability According to Ballard Error Type WorldCat Count
Worchester Highest Insertion 398
Metheun High Transposition 355
Universt Highest Omission 299
Unives Highest Omission 275
Westminister and Press Highest Insertion 266
Niagr High Omission 260
Phildel High Omission 235
Tallahasee High Omission 234
John Hopkins Press Highest Omission 227
Institi High Substitution 226
32WorldCat Typographical Errors
- Westminister
- Only included on Ballard list in combination with
other words - Total errors in WorldCat 628 (2.36)
- Require manual review
33Where are we now?
34WorldCat MARC 260 Evaluation
- Top 10 terms in 260 b in WorldCat
Term Count
press 2,094,111
co 1,664,005
university 1,550,435
dept 1,084,647
pub 984,234
research 853,954
service 710,314
institute 660,346
office 649,794
chu ban she 620,735
35WorldCat MARC 260 Evaluation
- University Press names in 260 b in WorldCat
Term Count
oxford 35,804
hopkins 22,564
cambridge 21,951
harvard 17,069
cornell 11,305
stanford 10,900
purdue 5,468
yale 5,076
princeton 4,746
rutgers 3,854
36Clustering
- Attempting programmatic clustering of publishers
using ISBN prefixes - Data clustering (The Free Dictionary)
- "The science of extracting useful information
from large data sets or databases" - Classification of similar objects into different
groups - Partitioning of a data set into
subsets (clusters) - Data in each subset (ideally) share some common
trait
37WorldCat Clustering Example
- Used ISBN prefix 019 (Oxford University Press)
- Total WorldCat records 58,004,317
- Records with ISBN prefix 019 84,276 (0.15)
- Non-unique publisher names from ISBN prefix
records 91,528
One or more 019 ISBN All 019 ISBNs
NACO normalized unique publisher names 1,550 1,386
Number of clusters 919 799
Non-singleton clusters 222 (24.16) 205 (25.66)
Largest cluster 82 text strings 81 text strings
38Challenges Publisher Name Authority File
- Quality issue
- Level of acceptance for cluster
- What is acceptable?
- Subsidiaries and Relationships
- Oxford Auckland
- Examined manually to determine relationship
- Form of name
- What is acceptable?
- Likely to use the most prominent form of name
39Questions and Discussion
- Contact Information
- connawal_at_oclc.org
- hearda_at_oclc.org
- Project Web Site
- http//www.oclc.org/research/projects/publisherns/