CS 430 / INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430 / INFO 430 Information Retrieval

Description:

Abstract: a free text record that summarizes a longer document. ... Extended to serials, maps, music, etc., but concepts still rely ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 37
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430 / INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 19 Metadata 1
2
Course Administration
3
Descriptive Metadata
Some methods of information retrieval search and
browse descriptive metadata about the objects.
Descriptive metadata typically consists of a
catalog or indexing record, or an abstract, one
record for each object. The record acts as a
surrogate for the object.
Usually the metadata is stored separately
from the object that it describes, but
sometimes is embedded in the object. Usually
the metadata is a set of text fields.
Textual metadata can be used to describe
non-textual objects, e.g., software, images,
music.
4
Documents and Surrogates
Document
Surrogate (catalog record)
The sea is calm to-night. The tide is full, the
moon lies fair Upon the straits -- on the
French coast the light Gleams and is gone the
cliffs of England stand, Glimmering and vast,
out in the tranquil bay. Come to the window,
sweet is the night-air! Only, from the long
line of spray Where the sea meets the
moon-blanch'd land, Listen! you hear the
grating roar Of pebbles which the waves draw
back, and fling, At their return, up the high
strand, Begin, and cease, and then again begin,
With tremulous cadence slow, and bring The
eternal note of sadness in.
Author Matthew Arnold Title Dover
Beach Genre Poem Date 1851
Notes 1. The surrogate is also a
document 2. Every word is different!
5
Surrogates for Non-textual materials
Text based methods of information retrieval can
search a surrogate for a photograph
Document
Surrogate (catalog record)
See next page for a textual catalog record about
a non-textual item (photograph).
6
Library of Congress catalog record (part)
CREATED/PUBLISHED between 1925 and
1930? SUMMARY U. S. President Calvin Coolidge
sits at a desk and signs a photograph,
probably in Denver, Colorado. A group of
unidentified men look on. NOTES Title supplied
by cataloger. Source Morey Engle. SUBJECTS
Coolidge, Calvin,--1872-1933.
Presidents--United States--1920-1930.
Autographing--Colorado--Denver--1920-1930.
Denver (Colo.)--1920-1930. Photographic
prints. MEDIUM 1 photoprint 21 x 26 cm. (8 x
10 in.)
7
Categories of Descriptive Metadata
Catalog metadata records that have a consistent
structure, organized according to systematic
rules. (Example Library of Congress
Catalog) Abstract a free text record that
summarizes a longer document. Indexing record
less formal than a catalog record, but more
structured than a simple abstract. (Example
PubMed)
8
Metadata Format
A metadata format is a set of rules that describe
the content and format of a set of metadata
records, e.g. AACR (Anglo American Cataloging
Rules) / MARC Dublin Core FGDC (Federal
Geographic Data Committee's Content Standard
for Digital Geospatial Metadata) IEEE Standard
for Learning Object Metadata
9
Uses of Metadata in Information Retrieval
Metadata is used in Information Retrieval systems
in conjunction with or instead of full text
indexing For physical objects, e.g.,
books For non-textual materials, e.g.,
pictures, maps, datasets For specialized areas
where high recall is important (e.g., medicine),
or where features such as intended audience are
hard to extract from the text (e.g.,
education) When people are ignorant of the
power of full text indexing (which is
surprisingly common)
10
Uses of Metadata in Information Retrieval
  • Descriptive metadata provides capabilities that
    are not possible with full text indexing
  • Allows fielded searching
  • author "Goethe"
  • Suitable for non-textual material
  • type "picture" and subject "Ithaca"
  • Can be used with controlled vocabulary
  • language "en" (English)

11
Information Retrieval with High Recall
Full-text Indexing (automated) Text only. Most
effective on medium-length documents on related
topics. High recall requires tuning system to
the specific collection and skilled
users. Catalogs and Indexes (created
manually) Can be used for all formats of
material Requires close quality control of
metadata creation High recall requires tuning
system to the specific collection and skilled
users.
12
Using Metadata for Information Retrieval
The basic operation of information retrieval is
to match the way that a user describes an
information requirement (a query), against the
way that items are described (an index). The
success of conventional catalogs (e.g., MARC
Anglo-American Cataloguing Rules) or indexing
services (e.g., Medline) comes from the
combination of precise language to describe
items trained and experienced users to
formulate queries.
13
Library Catalogs
Examples Cornell University Library
catalog http//catalog.library.cornell.edu/
Library of Congress, Prints and
Photographs http//www.loc.gov/rr/print/catalog.
html
14
Origins of Library Catalogs
Bibliographic Objective To bring together
like items To differentiate among similar
ones
Sir Anthony Panizzi, Keeper of Books at the
British Museum (1856-67). His Ninety-One Rules
(1841) were the basis of modern catalog rules.
15
Origins of Library Catalogs
Information Discovery to enable a person
to find a book of which either the author,
title or subject is known to show what the
library has by a given author, on a given
subject, or in a given kind of literature
to assist in the choice of a book as to its
edition (bibliographically) or to its
character (literary or topical).
Charles Ammi Cutter Librarian of the Boston
Athenaeum Rules for a Dictionary Catalog, 1874
16
Origins of Library Catalogs
Classification Division of subject matter
into a hierarchy. Typically used in libraries
to provided a subject-based order for shelving
books.
Melvil Dewey Acting Librarian of Amherst College
(1874) Dewey Decimal system of book
classification, uses the numbers 000 to 999 to
cover the general fields of knowledge and
decimals to fit special subjects.
17
Library Catalogs Technology Changes over the
Years
Materials to be catalogued Originally
books Extended to serials, maps, music,
etc., but concepts still rely heavily on
experience with books Form of catalog
Entries in books (Panizzi) Index cards
(Cutter) Online databases (Kilgour)
18
Shared Cataloguing OCLC
  • OCLC -- Large centralized transaction processing
    database system (http//www.oclc.org/)
  • When a library catalogs a book it deposits MARC
    record in OCLC
  • Other libraries can copy the record
  • saves duplication of cataloguing
  • OCLC has a database of holdings from all
    libraries
  • OCLC database has 69 million records, serves
    42,000 libraries
  • When developed by Fred Kilgour in 1967, OCLC was
    a pioneering computer system (had to develop own
    network, computer terminal, etc.)

19
Catalogs as Investments
Costs Conventional Catalog Records are
created by skilled librarians. (cost
estimate 100 per record). OCLC's catalog
has 69 million records. Total investment is
several billion dollars. Cataloguing
Standards Enable libraries to share
records Combine records of the past with
records created today Allow readers and
librarians to move between libraries
20
Layers of a Library Catalog
Encoding Rules that define how catalog
records are encoded in a computer system, e.g.,
XML mark-up. Syntax Rules that define the
fields and subfields, whether repeated, optional,
etc. Semantics Rules that define the values
of the field and subfield, with instructions for
cataloguers of what data to include and how to
decide when choices have to be made.
21
Library Cataloging using the Anglo American
Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
Rules for each category of material, e.g.,
monographs (books). Specify what fields should
be used and what data to include in each field.
Text strings were originally intended for printed
catalog cards. MARC format An exchange format
for catalog records. Includes encoding rules and
syntax specification. "MARC Catalog" Catalog
in MARC format, where content of each field
follows AACR2.
22
Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules
provide detailed rules for the choice of
fields the content of the data that goes into
each field the syntax of the data that goes
into each field The rules are an excellent
example of technical writing precise but clear.
For an example, see http//www.cs.cornell.edu/Cou
rses/cs430/2006fa/slides/AACR.pdf
23
Name authority files
  • An Authority File "brings together like items and
    differentiates among similar ones."
  • Caroline R. Arms or Caroline Ruth Arms?
  • Which William Phillips of Cardiff?
  • Mark Twain or Samuel Clemens?
  • Epithets
  • of Cardiff
  • doctor
  • Dates
  • 1832 - 1876
  • flourished 1860
  • circa 1832 - 1876

24
Name authority example
  • LC Control Number n 87870182
  • HEADING Arms, Caroline R. (Caroline
    Ruth)
  • 000 00907cz 2200205n 450
  • 001 4383796
  • 005 19890706143144.8
  • 008 70909nacannaab a aaa c
  • 010 __ a n 87870182
  • 035 __ a (DLC)n 87870182
  • 040 __ a InU c DLC d DLC
  • 100 10 a Arms, Caroline R. q
    (Caroline Ruth)
  • 400 10 w nna a Arms, Caroline
    Ruth
  • 400 10 a Arms, C. R. q
    (Caroline Ruth)
  • 670 __ a Arms, W.Y. Report on
    the performance problems of the
  • RLIN computer system, 1982 b t.p. (Caroline R.
    Arms)
  • 670 __ a LC data base, 8/24/87
    b (hdg. Arms, Caroline Ruth usage Caroline
    R. Arms, C. R. Arms)
  • 670 __ a Campus networking
    strategies, 1988 b CIP t.p. (Caroline Arms)
  • 670 __ a Phone call to pub.,
    2/10/88 b (Caroline Ruth Arms studied at
    Oxford)
  • 670 __ a Campus strategies for
    libraries and electronic information, c1990 b
    CIP t.p. (Caroline Arms) data sheet (b.
    10-24-45)
  • 953 __ a bz46 b bd24

25
Subject information
Library of Congress Subject Headings Academic
libraries--United States--Automation Hierarchical
classification Library of Congress call
number Z675.U5C16 Dewey Decimal
Classification 027.7 Creation and maintenance
of lists of subject headings and classifications
is a never ending task.
26
MARC Format
The MARC format was developed in the late 1960s
as a tagging scheme for exchanging catalog
records on magnetic tape. It remains the standard
way to represent such data. At present, MARC is
steadily being converted (slowly) to modern
computing formats, e.g., Unicode, XML.
27
MARC Monograph catalog record
Citation Caroline R. Arms, editor, Campus
strategies for libraries and electronic
information. Bedford, MA Digital Press, 1990.
28
MARC fields
tag value 001 89-16879 r93 050 Z675.U5C16
1990 082 027.7/0973 20 245 Campus strategies
for libraries and electronic title statement
information/Caroline Arms, editor. 260
Bedford, Mass. Digital Press, c1990.
publisher 300 xi, 404 p. ill. 24 cm.

collation 440 EDUCOM strategies series on
information technology

series title 504 Includes
bibliographical references (p. 373-381). 020
ISBN 1-55558-036-X 34.95
29
MARC fields (continued)
650 Academic libraries--United
States--Automation.
subject
heading 650 Libraries and electronic
publishing--United States. 650 Library
information networks--United States. 650
Information technology--United States. 700
Arms, Caroline R. (Caroline Ruth) 040 DLC DLC
DLC 043 n-us--- 955 CIP ver. br02 to SL
02-26-90 985 APIF/MIG
30
MARC Encoding
tag 260 subfield a Bedford, Mass.
subfield b Digital Press, subfield
c c1990. MARC encoding 2600abcBedford,
Mass. Digital Press,c1990. Definitely not a
modern encoding!
Note that the content is designed to be part of a
printed catalog record and is not in a convenient
format for computer manipulation.
31
Modernizing MARC
1. Keep the content of the catalog
record 2. Convert to Unicode for representing
scripts 3. Convert to XML for tagging cataloguing
metadata. MARCXML (MARC 21 XML) http//www.loc.go
v/standards/marcxml/ Direct conversion to XML
tagging Metadata Object Description Schema
(MODS) http//www.loc.gov/standards/mods/ Subse
t of MARC with data clean-up
32
MARC XML
Simple and Flexible MARC XML Schema The schema
retains the semantics of MARC. Fields are treated
as elements with the tag as an attribute and
indicators treated as attributes. Subfields are
treated as subelements with the subfield code as
an attribute. Lossless Conversion of MARC to
XML Roundtripability from XML back to
MARC Data Presentation by writing a XML
stylesheet Validation of MARC
data Extensibility
33
MODS Example (extracts)
ltmodsgt lttitleInfogt lttitlegtSound and fury
lt/titlegt ltsubTitlegtthe making of the
punditocracy /lt/subTitlegt lt/titleInfogt ltname
type"personal"gt ltnamePartgtAlterman,
Ericlt/namePartgt ltrolegt ltroleTerm
type"text"gtcreatorlt/roleTermgt
lt/rolegt lt/namegt
34
MODS Example (extracts)
lttypeOfResourcegttextlt/typeOfResourcegt ltoriginInfogt
ltplacegt ltplaceTerm
type"text"gtIthaca, N.Ylt/placeTermgt
lt/placegt ltpublishergtCornell University
Presslt/publishergt ltdateIssuedgtc1999lt/dateIssu
edgt lt/originInfogt ltlanguagegt
ltlanguageTerm authority"iso639-2b"
type"code"gtenglt/languageTermgt lt/
languagegt lt/modsgt
35
Notes on MARC
  • A great achievement
  • Developed in 1960s
  • Magnetic tape exchange format for printing
    catalog records
  • The dawn of computing
  • mixed upper and lower case
  • variable length fields,
  • repeated fields
  • non-Roman scripts
  • 100(?) million records with standard content
    and format
  • Thousands of trained librarians (millions?)

36
Notes on MARC
  • A great problem
  • Not designed for computer algorithms
  • One record per item (poor links between
    records)
  • Tied to traditional materials and
    traditional practices
  • Not Unicode
  • 100 of million records at 100 -- 10 billion
  • A classic legacy system!
Write a Comment
User Comments (0)
About PowerShow.com