Title: Controlled Vocabularies: Name Authority Control
1Controlled Vocabularies Name Authority Control
- University of California, Berkeley
- School of Information Management and Systems
- SIMS 202 Information Organization and Retrieval
Slide authors Ray R. Larson, Marti Hearst
2Review
- Dublin Core
- Other Metadata Systems
- Cognitive basis of categorization
3Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
4Issues in Dublin Core
- Lack of guidance on what to put into each element
- How to structure or organize at the element
level? - How ensure consistency across descriptions for
the same persons, places, things, etc.
5More Metadata Systems
- The following are a sample of metadata systems
for a variety of special types of
data/documents/objects.
6Type of Metadata systems and standards
- Naming and ID systems URLs, ISBNs
- Bibliographic description MARC, Dublin Core,
TEI, etc. - Music -- SMDL
- Images and objects CIMI, VRA Core Categories
- Numeric Data DDI, SDSM
- Geospatial Data FGDC
- Collections EAD
7Metadata Resources
- Check the Links section from the class home page
- Best site is the Digital Library Metadata
Resources page from IFLA at http//www.ifla.org/I
I/metadata.htm
8Today
- More on Controlled vocabularies
- Choice of names
- Form of names
- Name Authority files
- Types of Controlled Vocabularies
- Facetted vs. Hierarchic organization of
vocabularies
9Controlled Vocabularies
- Vocabulary control is the attempt to provide a
standardized and consistent set of terms (such as
subject headings, names, classifications, etc.)
with the intent of aiding the searcher in finding
information.
10Controlled Vocabularies
- Names and name authorities Other Types of
Controlled Vocabulary (Today) - Design of controlled vocabularies for subject
access -- Thesaurus design (Thursday)
11Names
- Cutters objectives of bibliographic description
- To enable a person to find a document of which
the author is known - To show what the library has by a given author
- First serves access
- Second serves collocation
12Problems with Names
- How many names should be associated with a
document? - Which of these should be the main entry?
- What form should each of the names take?
- What references should be made from other
possible forms of names that havent been used?
13The problem
- Proliferation of the forms of names
- Different names for the same person
- Different people with the same names
- Examples
- from Books in Print (semi-controlled but not
consistent) - ERIC author index (not controlled)
14Rules for description
- AACR II and other sets of descriptive cataloging
rules provide guidelines for - Determining the number of name entries
- Choosing a main entry
- Deciding on the form of name to be used
- Deciding when to make references
15Authority control
- Authority control is concerned with creation and
maintenance of a set of terms that have been
chosen as the standard representatives (also know
as established) based on some set of rules. - If you have rules, why do you need to keep track
of all of the headings? Cant you just infer the
headings from the rules?
16Conditions of Authorship?
- Single person or single corporate entity
- Unknown or anonymous authors
- Fictitiously ascribed works
- Shared responsibility
- Collections or editorially assembled works
- Works of mixed responsibility (e.g. translations)
- Related Works
17Added Entries
- Personal names
- Collaborators
- Editors, compilers, writers
- Translators (in some cases)
- Illustrators (in some cases)
- Other persons associated with the work (such as
the honoree in a Festschrift). - Corporate Names
- Any prominently named corporate body that has
involvement in the work beyond publication,
distribution, etc.
18Choice of Name
- AACR II says that the predominant form of the
name used in a particular authors writings
should be chosen as the form of name. - References should be made from the other forms of
the name.
19Form of the Name
- When names appear in multiple forms, one form
needs to be chosen. Criteria for choice are - Fullness (e.g. Full names vs. initials only)
- Language of the name.
- Spelling (choose predominant form)
- Entry element
- John Smith or Smith, John?
- Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
20Name Authority Files
IDNAFL8057230 STp ELn STHa MSc
UIPa TD19910821174242 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF05-14-80 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-21-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 053
PR6005.R517 100 10 Creasey, John 400 10
Cooke, M. E. 400 10 Cooke, Margaret,d1908-1973
400 10 Cooper, Henry St. John,d1908-1973
400 00 Credo,d1908-1973 400 10 Fecamps,
Elise 400 10 Gill, Patrick,d1908-1973 400
10 Hope, Brian,d1908-1973 400 10 Hughes,
Colin,d1908-1973 400 10 Marsden, James 400
10 Matheson, Rodney 400 10 Ranger, Ken 400
20 St. John, Henry,d1908-1973 400 10 Wilde,
Jimmy 500 10 wnnncaAshe, Gordon,d1908-1973
Different names for the same person
21Name Authority Files
IDNAFO9114111 STp ELn STHa MSn
UIPa TD19910817053048 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF06-03-91 RFEa CSCc SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-19-91 040 OCoLCcOCoLC 100 10 Marric,
J. J.,d1908-1973 500 10 wnnncaCreasey,
John 663 Works by this author are entered
under the name used in the item. For a
listing of other names used by this author,
search also underbCrease y, John 670
OCLC 13441825 His Gideon's day, 1955b(hdg.
Creasey, John usage J .J. Marric) 670
LC data base, 6/10/91b(hdg. Creasey, John
usage J.J. Marric) 670 Pseuds. and
nicknames dict., c1987b(Creasey, John,
1908-1973 Britis h author pseud.
Marric, J. J.)
22Name authority files
IDNAFL8166762 STp ELn STHa MSc
UIPa TD19910604053124 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF08-20-81 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
06-06-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 100 10
Butler, William Vivian,d1927- 400 10 Butler,
W. V.q(William Vivian),d1927- 400 10 Marric,
J. J.,d1927- 670 His The durable
desperadoes, 1973. 670 His The young
detective's handbook, c1981bt.p. (W.V. Butler)
670 His Gideon's way, 1986bCIP t.p.
(William Vivian Butler writing as J .J.
Marric)
Different people writing with the same name
23Other Types of Controlled Vocabularies
- Gazetteers (Geographic Names)
- Code lists (e.g. LC Language Codes)
- Subject Heading Lists
- Classification Schemes
- Thesaurii
24Structure of an IR System
Storage Line
Interest profiles Queries
Documents data
Search Line
Information Storage and Retrieval System
Rules of the game Rules for subject indexing
Thesaurus (which consists of Lead-In Vocabulary
and Indexing Language
Indexing (Descriptive and Subject)
Formulating query in terms of descriptors
Storage of profiles
Storage of Documents
Store1 Profiles/ Search requests
Store2 Document representations
Comparison/ Matching
Potentially Relevant Documents
Adapted from Soergel, p. 19
25Uses of Controlled Vocabularies
- Library Subject Headings, Classification and
Authority Files. - Commercial Journal Indexing Services and
databases - Yahoo, and other Web classification schemes
- Online and Manual Systems within organizations
- SunSolve
- MacArthur
26Types of Indexing Languages
- Uncontrolled Keyword Indexing
- Indexing Languages
- Controlled, but not structured
- Thesauri
- Controlled and Structured
- Classification Systems
- Controlled, Structured, and Coded
- Faceted Classification Systems
27Indexing Languages
- An index is a systematic guide designed to
indicate topics or features of documents in order
to facilitate retrieval of documents or parts of
documents. - An Indexing language is the set of terms used in
an index to represent topics or features of
documents, and the rules for combining or using
those terms.
28Indexing Languages
- Library of Congress Subject Headings
- Yellow Pages Topics
- Wilson Indexes (Readers Guide)
29Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among Synonymous, Equivalent, Broader,
Narrower and other Related Terms
30Thesauri (cont.)
- National and International Standards for Thesauri
- ANSI/NISO z39.19--1994 -- American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x -- American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 -- Documentation -- Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964-- Documentation -- Guidelines for the
establishment and development of multilingual
thesauri
31Thesauri (cont.)
- Examples
- The ERIC Thesaurus of Descriptors
- The Art and Architecture Thesaurus
- The Medical Subject Headings (MESH) of the
National Library of Medicine
32Hierarchical vs. Faceted (Subject Heading vs.
Descriptor)Category Systems
Slide author Marti Hearst
33Controlled Vocabulary(The following slides
follow Bates 88)
- Start with the text of the document
- Attempt to control or regularize
- The concepts expressed within
- mutually exclusive
- exhaustive
- The language used to express those concepts
- limit the normal linguistic variations
- regulate word order and structure of phrases
- reduce the number of synonyms or near-synonyms
- Also, provide cross-references between concepts
and their expression.
Slide author Marti Hearst
34Classification Schemes
- Classify possible concepts.
- Goals
- Completely distinct conceptual categories
(mutually exclusive) - Complete coverage of conceptual categories
(exhaustive)
Slide author Marti Hearst
35AssigningHeadings vs. Descriptors
- Subject headings
- assign one (or a few) complex heading(s) to the
document
- Descriptors
- Mix and match
How would we describe recipes using each
technique?
Slide author Marti Hearst
36Subject Heading vs. Descriptor
- WILSONLINE
- Athletes
- Athletes--HeathHygiene
- Athletes--Nutrition
- Athletes--Physical Exams
-
- Athletics
- Athletics -- Administration
- Athletics -- Equipment -- Catalogs
-
- Sports -- Accidents and injuries
- Sports -- Accidents and injuries -- prevention
- ERIC
- Athletes
- Athletic Coaches
- Athletic Equipment
- Athletic Fields
- Athletics
-
- Sports psychology
- Sportsmanship
Slide author Marti Hearst
37Subject Headings vs. Descriptors
- Describe one concept within a document
- Designed to be used in Boolean searching
- Combine to describe the desired document
- Many (5-25) descriptors per document
- Describe the contents of an entire document
- Designed to be looked up in an alphabetical index
- Look up document under its heading
- Few (1-5) headings per document
Slide author Marti Hearst
38Hierarchical Classification
- Each category is successively broken down into
smaller and smaller subdivisions - No item occurs in more than one subdivision
- Each level divided out by a character of
division. Also known as a feature. - Example distinguish Literature based on
- Language
- Genre
- Time Period
Slide author Marti Hearst
39Hierarchical Classification
Slide author Marti Hearst
40Labeled Categories for Hierarchical Classification
- LITERATURE
- 100 English Literature
- 110 English Prose
- English Prose 16th Century
- English Prose 17th Century
- English Prose 18th Century
- ...
- 111 English Poetry
- 121 English Poetry 16th Century
- 122 English Poetry 17th Century
- ...
- 112 English Drama
- 130 English Drama 16th Century
-
- 200 French Literature
Slide author Marti Hearst
41Faceted Classification
- Create a separate, free-standing list for each
characteristic of division (feature). - Combine features to create a classification.
Slide author Marti Hearst
42Faceted Classification along with Labeled
Categories
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
Slide author Marti Hearst
43Important QuestionHow to use both types
ofclassification structures?
- How to look through them?
- How to use them in search?
Slide author Marti Hearst
44Classification Systems
- A classification system is an indexing language
often based on a broad ordering of topical areas.
Thesauri and classification systems both use this
broad ordering and maintain a structure of
broader, narrower, and related topics.
Classification schemes commonly use a coded
notation for representing a topic and its place
in relation to other terms.
45Classification Systems (cont.)
- Examples
- The Library of Congress Classification System
- The Dewey Decimal Classification System
- The ACM Computing Reviews Categories
- The American Mathematical Society Classification
System
46Automatic Indexing and Classification
- Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words. - More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document. - Automatic classification attempts to
automatically group similar documents using
either - A fully automatic clustering method.
- An established classification scheme and set of
documents already indexed by that scheme.
47Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
48Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
Doc
Doc
Doc
Doc
Search Engine
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes. 2. Search using
document contents 3. Obtain ranked list 4. Assign
document to N categories ranked over
threshold. OR assign to top-ranked category