Organizing Information: Metadata and Controlled Vocabularies - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Organizing Information: Metadata and Controlled Vocabularies

Description:

Design of controlled vocabularies for subject access -- Thesaurus design. 12/11/98 ... A Thesaurus is a collection of selected vocabulary (preferred terms or ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 58
Provided by: rayrl
Category:

less

Transcript and Presenter's Notes

Title: Organizing Information: Metadata and Controlled Vocabularies


1
Organizing Information Metadata and Controlled
Vocabularies
  • Ray R. Larson
  • University of California, Berkeley
  • School of Information Management and Systems

2
Overview Metadata and Controlled Vocabularies
  • Definitions
  • Origins and Uses of Controlled Vocabularies for
    Information Retrieval
  • Metadata
  • Types of Indexing Languages, Thesauri and
    Classification Systems
  • Process of Design and Development of Thesauri

3
Information Organization and Retrieval
  • To organize is to (1) furnish with organs, make
    organic, make into living tissue, become organic
    (2) form into an organic whole give orderly
    structure to frame and put into working order
    make arrangements for.
  • Knowledge is knowing, familiarity gained by
    experience persons range of information a
    theoretical or practical understanding of the
    sum of what is known.
  • To retrieve is to (1) recover by investigation or
    effort of memory, restore to knowledge or recall
    to mind regain possession of (2) rescue from a
    bad state, revive, repair, set right.
  • Information is (1) informing, telling thing
    told, knowledge, items of knowledge, news.

The Oxford English Dictionary, cf. Rowley
4
Information Properties
  • Information can be communicated electronically
  • Broadcasting
  • Networking
  • Information can be easily duplicated and shared
  • Problems of Ownership
  • Problems of Control

Adapted from Silicon Dreams by Robert W. Lucky
5
Information Hierarchy
  • Data
  • The raw material of information
  • Information
  • Data organized and presented by someone
  • Knowledge
  • Information read, heard or seen and understood
  • Wisdom
  • Distilled and integrated knowledge and
    understanding

6
Information Hierarchy
Wisdom
Knowledge
Information
Data
7
Information Life Cycle
8
Information Life Cycle
  • Authoring/Modifying
  • Organizing/Indexing
  • Storing/Retrieving
  • Distribution/Networking
  • Accessing/Filtering
  • Using/Creating

9
Origins
  • Very early history of content representation
  • Sumerian tokens and envelopes
  • Alexandria - pinakes
  • Indices

10
Origins
  • Biblical Indexes and Concordances (Hugo de St.
    Caro 500 monks, 1247 -- KWIC)
  • Journal Indexes
  • Information Explosion following WWII
  • Cranfield Studies of indexing languages and
    information retrieval
  • Development of bibliographic databases
  • Index Medicus -- production and Medlars searching

11
Origins
  • Communication theory revisited
  • Problems with transmission of meaning

Noise
12
Structure of an IR System
Search Line
Adapted from Soergel, p. 19
13
Metadata
  • Data about data
  • Information about Information
  • Description of information structure and contents
    for individual information items, or entire
    collections of information

14
Types of Metadata
  • Element names.
  • Element description.
  • Element representation.
  • Element coding.
  • Element semantics.
  • Element classification.

15
Metadata Systems
  • AACRII/MARC
  • Dublin Core
  • RDF (Resource Description Framework)
  • SGML/XML
  • DBMS Metadata
  • Controlled vocabularies

16
Goals of Descriptive Cataloging (AACRII/MARC)
  • 1. To enable a person to find a document of which
  • the author, or
  • the title, or
  • the subject is known
  • 2. To show what a library has
  • by a given author
  • on a given subject (and related subjects)
  • in a given kind (or form) of literature.
  • 3. To assist in the choice of a document
  • as to its edition (bibliographically)
  • as to its character (literary or topical)

Charles A. Cutter, 1876
17
Dublin Core Elements
  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Other Contributors
  • Date
  • Resource Type
  • Format
  • Resource Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights Management

18
RDF (W3C)
  • A model for representing named properties and
    property values
  • Resources (the things described)
  • Properties (aspects, attributes, characteristics
    of resources)
  • Statements (ResourcePropertyValue of Property
    for the Resource)
  • Expressed in XML

19
SGML XML
  • What is SGML/XML?
  • Document Type Definitions
  • Document Markup
  • Sources and Resources

20
Databases Metadata
  • Particularly in the Relational Model metadata is
    part of the Database, providing information about
    the structure and contents of the database
  • What Relations (tables) in the the DB
  • Relation(table) attributes (domains)
  • Attribute representation and storage
  • Other information (indexes, etc)

21
Controlled Vocabularies
  • Vocabulary control is the attempt to provide a
    standardized and consistent set of terms (such as
    subject headings, names, classifications, etc.)
    with the intent of aiding the searcher in finding
    information.

22
Controlled Vocabularies
  • Names and name authorities
  • Design of controlled vocabularies for subject
    access -- Thesaurus design

23
Names
  • Cutters (1876) objectives of bibliographic
    description
  • To enable a person to find a document of which
    the author is known.
  • To show what the library has by a given author.
  • First serves access.
  • Second serves collocation.

24
Problems with Names
  • How many names should be associated with a
    document?
  • Which of these should be the main entry?
  • What form should each of the names take?
  • What references should be made from other
    possible forms of names that havent been used?

25
The problem
  • Proliferation of the forms of names
  • Different names for the same person
  • Different people with the same names
  • Examples
  • from Books in Print (semi-controlled but not
    consistent)
  • ERIC author index (not controlled)

26
Rules for description
  • AACR II and other sets of descriptive cataloging
    rules provide guidelines for
  • Determining the number of name entries
  • Choosing a main entry
  • Deciding on the form of name to be used
  • Deciding when to make references

27
Authority control
  • Authority control is concerned with creation and
    maintenance of a set of terms that have been
    chosen as the standard representatives (also know
    as established) based on some set of rules.
  • If you have rules, why do you need to keep track
    of all of the headings?

28
Conditions of Authorship?
  • Single person or single corporate entity
  • Unknown or anonymous authors
  • Shared responsibility
  • Collections or editorially assembled works
  • Works of mixed responsibility (e.g. translations)
  • Related Works

29
Added Entries
  • Personal names
  • Collaborators
  • Editors, compilers, writers
  • Translators (in some cases)
  • Illustrators (in some cases)
  • Other persons associated with the work (such as
    the honoree in a Festschrift).
  • Corporate Names
  • Any prominently named corporate body that has
    involvement in the work beyond publication,
    distribution, etc.

30
Choice of Name
  • AACR II says that the predominant form of the
    name used in a particular authors writings
    should be chosen as the form of name.
  • References should be made from the other forms of
    the name.

31
Form of the Name
  • When names appear in multiple forms, one form
    needs to be chosen. Criteria for choice are
  • Fullness (e.g. Full names vs. initials only)
  • Language of the name.
  • Spelling (choose predominant form)
  • Entry element
  • John Smith or Smith, John?
  • Mao Zedong or Zedong, Mao? (Mao Tse Tung?)

32
Name Authority Files
IDNAFL8057230 STp ELn STHa MSc
UIPa TD19910821174242 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF05-14-80 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-21-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 053
PR6005.R517 100 10 Creasey, John 400 10
Cooke, M. E. 400 10 Cooke, Margaret,d1908-1973
400 10 Cooper, Henry St. John,d1908-1973
400 00 Credo,d1908-1973 400 10 Fecamps,
Elise 400 10 Gill, Patrick,d1908-1973 400
10 Hope, Brian,d1908-1973 400 10 Hughes,
Colin,d1908-1973 400 10 Marsden, James 400
10 Matheson, Rodney 400 10 Ranger, Ken 400
20 St. John, Henry,d1908-1973 400 10 Wilde,
Jimmy 500 10 wnnncaAshe, Gordon,d1908-1973
Different names for the same person
33
Name Authority Files
IDNAFO9114111 STp ELn STHa MSn
UIPa TD19910817053048 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF06-03-91 RFEa CSCc SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-19-91 040 OCoLCcOCoLC 100 10 Marric,
J. J.,d1908-1973 500 10 wnnncaCreasey,
John 663 Works by this author are entered
under the name used in the item. For a
listing of other names used by this author,
search also underbCrease y, John 670
OCLC 13441825 His Gideon's day, 1955b(hdg.
Creasey, John usage J .J. Marric) 670
LC data base, 6/10/91b(hdg. Creasey, John
usage J.J. Marric) 670 Pseuds. and
nicknames dict., c1987b(Creasey, John,
1908-1973 Britis h author pseud.
Marric, J. J.)
34
Name authority files
IDNAFL8166762 STp ELn STHa MSc
UIPa TD19910604053124 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF08-20-81 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
06-06-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 100 10
Butler, William Vivian,d1927- 400 10 Butler,
W. V.q(William Vivian),d1927- 400 10 Marric,
J. J.,d1927- 670 His The durable
desperadoes, 1973. 670 His The young
detective's handbook, c1981bt.p. (W.V. Butler)
670 His Gideon's way, 1986bCIP t.p.
(William Vivian Butler writing as J .J.
Marric)
Different people writing with the same name
35
Controlled Vocabularies for Information Access
  • The greatest problem of today is how to teach
    people to ignore the irrelevant, how to refuse to
    know things, before they are suffocated. For too
    many facts are as bad as none at all. (W.H.
    Auden)
  • Similarly, there are too many ways of expressing
    or explaining the topic of a document.
  • Controlled vocabularies are sets of Rules for
    topic identification and indexing, and a
    THESAURUS, which consists of lead-in vocabulary
    and an limited and selective Indexing Language
    sometimes with special coding or structures.

36
Structure of an IR System
Search Line
Adapted from Soergel, p. 19
37
Uses of Controlled Vocabularies
  • Library Subject Headings, Classification and
    Authority Files.
  • Commercial Journal Indexing Services and
    databases
  • Yahoo, and other Web classification schemes
  • Online and Manual Systems within organizations
  • SunSolve
  • MacArthur

38
Types of Indexing Languages
  • Uncontrolled Keyword Indexing
  • Indexing Languages
  • Controlled, but not structured
  • Thesauri
  • Controlled and Structured
  • Classification Systems
  • Controlled, Structured, and Coded
  • Faceted Classification Systems

39
Indexing Languages
  • An index is a systematic guide designed to
    indicate topics or features of documents in order
    to facilitate retrieval of documents or parts of
    documents.
  • An Indexing language is the set of terms used in
    an index to represent topics or features of
    documents, and the rules for combining or using
    those terms.

40
Indexing Languages
  • Library of Congress Subject Headings
  • Yellow Pages Topics
  • Wilson Indexes (Readers Guide)

41
Thesauri
  • A Thesaurus is a collection of selected
    vocabulary (preferred terms or descriptors) with
    links among Synonymous, Equivalent, Broader,
    Narrower and other Related Terms

42
Thesauri (cont.)
  • National and International Standards for Thesauri
  • ANSI/NISO z39.19--1994 -- American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
  • ANSI/NISO Draft Standard Z39.4-199x -- American
    National Standard Guidelines for Indexes in
    Information Retrieval
  • ISO 2788 -- Documentation -- Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964-- Documentation -- Guidelines for the
    establishment and development of multilingual
    thesauri

43
Thesauri (cont.)
  • Examples
  • The ERIC Thesaurus of Descriptors
  • The Art and Architecture Thesaurus
  • The Medical Subject Headings (MESH) of the
    National Library of Medicine

44
Why develop a thesaurus?
  • To provide a conceptual structure or space for
    a body of information
  • To make it possible to adequately describe the
    topical contents of informational objects at an
    appropriate level of generality or specificity
  • To provide enhanced search capabilities and to
    improve the effectiveness of searching (I.e., to
    retrieve most of the relevant material without
    too much irrelevant material).

45
Why develop a thesaurus?
  • To provide vocabulary (or terminological)
    control.
  • When there are several possible terms designating
    a single concept, the thesaurus should lead the
    indexer or searcher to the appropriate concept,
    regardless of the terms they start with.

46
Preliminary considerations
  • What is used now?
  • Continue using an existing thesaurus?
  • Ad hoc modification of existing thesaurus?
  • Develop a new well-structured thesaurus?
  • What is the scope and complexity of the subject
    field?
  • What kind of retrieval objects or data will be
    dealt with?
  • How exhaustive and specific is the desired
    description of objects?

47
Preliminary Considerations
  • The scope and complexity of the field will
    provide some indication of the scope and
    complexity of the thesaurus.
  • It is better to plan for a larger and more
    comprehensive system than a smaller system that
    rapidly will become inadequate as the database
    grows.
  • Development of a good thesaurus requires a major
    intellectual effort as well as clerical
    operations like data entry and production of
    sorted lists.

48
Development of a Thesaurus
  • Term Selection.
  • Merging and Development of Concept Classes.
  • Definition of Broad Subject Fields and Subfields.
  • Development of Classificatory structure
  • Review, Testing, Application, Revision.

49
Flow of Work in Thesaurus Construction
50
The Indexing Process
  • Concept identification
  • term selection (via thesaurus)
  • term assignment

51
Application The Indexing Process (Manual)
Select Alternative term to represent Concept
NO
Is Term suitable
Adapted from ISO 5963, p.5
52
Classification Systems
  • A classification system is an indexing language
    often based on a broad ordering of topical areas.
    Thesauri and classification systems both use this
    broad ordering and maintain a structure of
    broader, narrower, and related topics.
    Classification schemes commonly use a coded
    notation for representing a topic and its place
    in relation to other terms.

53
Classification Systems (cont.)
  • Examples
  • The Library of Congress Classification System
  • The Dewey Decimal Classification System
  • The ACM Computing Reviews Categories
  • The American Mathematical Society Classification
    System

54
Automatic Indexing and Classification
  • Automatic indexing is typically the simple
    deriving of keywords from a document and
    providing access to all of those words.
  • More complex Automatic Indexing Systems attempt
    to select controlled vocabulary terms based on
    terms in the document.
  • Automatic classification attempts to
    automatically group similar documents using
    either
  • A fully automatic clustering method.
  • An established classification scheme and set of
    documents already indexed by that scheme.

55
Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
56
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
Doc
Doc
Doc
Doc
Search Engine
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes. 2. Search using
document contents 3. Obtain ranked list 4. Assign
document to N categories ranked over
threshold. OR assign to top-ranked category
57
References
  • Soegel, D. Indexing Languages and Thesauri
    Construction and Maintenance. Los Angeles
    Melville Publishing Co., 1974
  • Foskett, A.C. The Subject Approach to
    Information. London Clive Bingley, 1982.
  • Standards
  • ISO 2788 -- Documentation -- Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964-- Documentation -- Guidelines for the
    establishment and development of multilingual
    thesauri
  • ANSI/NISO z39.19--1994 -- American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
Write a Comment
User Comments (0)
About PowerShow.com