Matching names in parallel - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Matching names in parallel

Description:

WorldCat Identities. Bring together all of WorldCat's ... Nearly 19 million different identities' in WorldCat. 80 million (nominally) controlled headings ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 27
Provided by: tho130
Category:

less

Transcript and Presenter's Notes

Title: Matching names in parallel


1
Matching names in parallel
  • T. Hickey
  • Access 2006
  • 2006 October

2
Virtual International Authority File
  • Link national authority records
  • Build on their authority work
  • Move towards universal bibliographic control
  • Allow national or regional variations in
    authorized forms to co-exist
  • Support needs for variations in preferred
    language, script, and spelling
  • 10 million WorldCat records in non-English
    metadata

3
Joint VIAF Project
4
Matching Variations
  • In the LCNAF and PND authority files
  • Same name, same person
  • Same name, different people
  • Different names, same person
  • Missing person in one file

5
Two Different People One Name
  • Adams, Mike
  • PND a golfer
  • LCNAF author of a Beatles collector's guide

6
One Person Two Names
  • LCNAF Morel, Pierre
  • PND Morellus, Petrus

7
Enhancing the Authorities
8
Strong Matching Attributes
  • A work (title) in common
  • Common control numbers (ISBN, ISSN, or LCCN)
  • Exact birth and death year
  • Joint authors
  • Name as subject

9
Weaker Attributes
  • Only one of birth/death date(s) (allows some
    variation)
  • Subject area of works (two levels)
  • Format (books, films, musical scores, etc.)
  • Language
  • Publisher
  • Partial title match
  • Date of publication
  • Country
  • Role (author, illustrator, composer, etc.)
  • Format (books, films, musical scores, etc.)

10
Computing it
  • Standard approach
  • Generate keys and data
  • Load information into a database
  • Index it
  • Extract fields needed
  • Map/Reduce approach
  • Split the database up
  • Run parallel jobs
  • Bring information together via map/reduce
  • Assemble information in stages

11
Map/Reduce
  • Two stages
  • Map
  • Read in source file (e.g. MARC-21)
  • Write out key data
  • Reduce
  • Read in array of data for each unique key
  • Write out key data

12
Overview of MapReduce
Source Dean Ghemawat (Google)
13
Our Implementation
  • Written in Python
  • Uses ssh and XML-RPC for control and
    communication
  • Map/Reduce seems to add 10 overhead
  • Ran an earlier implementation on a 48 cpu cluster
  • Current VIAF cluster is a 12 cpu cluster on 4
    nodes
  • Running Linux and 64-bit Python

14
VIAF Matching Code
  • 17 modules
  • 1,100 lines of code
  • Plus
  • 600 lines configuration
  • 2,755 lines of tables embedded in code

15
PND Catalog
PND Catalog
LC Authority
LC Catalog
PND Authority
Extract Data
Extract Data
Extract Data
Extract Data
Extract Data
VIAF Data Flow
build buckets
surname forename,date
eliminate forename, date conflicts from buckets
get changed Ids
identify compare data
potential pairs
select compare data
changed authority ids
select compare data
pair idbib/authid
identify compare data
pair id compare data
pair idbib/authid
pair id compare data
compare
pair id scores
16
WorldCat Identities
  • Bring together all of WorldCats information
    about people
  • Name(s)
  • Works by and about
  • Subjects
  • Dates
  • Fiction/non-fiction
  • Roles
  • Co-authors
  • Add links
  • Wikipedia
  • Authority files

17
Sample Identity
18
Statistics
  • Nearly 19 million different identities in
    WorldCat
  • 80 million (nominally) controlled headings
  • The WorldCat Identity code is 800 lines of
    Python in 4 modules (plus XSLT, CSS, etc.)

19
Identities Data Flow
Cover Art
WorldCat
FRBR
Audience
Stage 1
NameInfo
Citation
Authorities
Stage 3
Stage 2
NameInfo
Citations
Stage 4
Identities
Wikipedia
20
Identities Stage 1Extract Data From WorldCat
  • Input WorldCat (MARC-21)
  • Map output
  • NameKey ltnameInfogt
  • WorkID ltcitationgt
  • Reduce output
  • WorkID ltbest citationgt
  • NameKey ltcumulative nameInfogt

21
Identities Stage 2Extract Data From Authorities
  • Input NACO Authorities file (MARC-21)
  • Map output
  • NameKey ltauthorityInfogt
  • XTos
  • XFroms
  • Reduce output
  • NameKey ltauthorityInfo, symetric xrefsgt

22
Identities Stage 3Connect Citations with Names
  • Input
  • Stage 1 output
  • WorkID ltby/about citationgts
  • NameKey ltnameInfogt
  • Map output
  • NameKey ltnameInfogt
  • NameKey lttopCitationsgt

23
Identities Stage 4Create Identities
  • Input
  • Authority info from stage 2
  • Merged name info from stage 3
  • Merged citations from stage 3
  • Map output
  • Pass through
  • Reduce output
  • Pnkey ltIdentity Recordgt

24
Schedules
  • Identities
  • Up this year?
  • VIAF
  • Reload, rematch this year
  • Public service up early 2007

25
Conclusions
  • Our merged files (e.g. WorldCat) are really quite
    large
  • More processing power opens up new ways of
    manipulating and looking at our data
  • Parallel processing is the only way to obtain the
    cycles needed
  • Map-Reduce is an attractive way to do parallel
    processing
  • Forces decomposition
  • Scales well
  • Opens up new possibilities

26
Thank you
  • T. Hickey
  • VIAF.org
  • http//errol.oclc.org/laf/n82-54463.html
  • Access 2006
  • 2006 October
Write a Comment
User Comments (0)
About PowerShow.com