CS 4300 INFO 4300 Information Retrieval - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

CS 4300 INFO 4300 Information Retrieval

Description:

Text based methods of information retrieval can search a surrogate for a photograph ... Information Retrieval with High Recall. Full-text Indexing (automated) ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 38
Provided by: wya2
Category:

less

Transcript and Presenter's Notes

Title: CS 4300 INFO 4300 Information Retrieval


1
CS 4300 / INFO 4300 Information Retrieval
Lecture 21 Metadata 1 Library Catalogs
2
Course Administration
Assignment 4 has been posted If you have not
received an automated email message giving a
password to the Center for Advanced Computing and
the Hadoop cluster, please send email to the
course team.
3
Course Administration
Assignment 4 The mechanics of running a Hadoop
program are unfortunately rather complex and we
strongly urge you to get started early on this
assignment. See The materials that describe
the Hadoop cluster are new. Please send
corrections to the course team. After the
discussion class on Wednesday there will be a
short workshop for people who are new to Linux.
http//www.infosci.cornell.edu/courses/info4300/20
08fa/HadoopHints.html
4
Descriptive Metadata
Some methods of information retrieval search and
browse descriptive metadata about the objects.
Descriptive metadata typically consists of a
catalog or indexing record, or an abstract, one
record for each object. The record acts as a
surrogate for the object.
Usually the metadata is stored separately
from the object that it describes, but
sometimes is embedded in the object. Usually
the metadata is a set of text fields.
Textual metadata can be used to describe
non-textual objects, e.g., software, images,
music.
5
Documents and Surrogates
Document
Surrogate (catalog record)
The sea is calm to-night. The tide is full, the
moon lies fair Upon the straits -- on the
French coast the light Gleams and is gone the
cliffs of England stand, Glimmering and vast,
out in the tranquil bay. Come to the window,
sweet is the night-air! Only, from the long
line of spray Where the sea meets the
moon-blanch'd land, Listen! you hear the
grating roar Of pebbles which the waves draw
back, and fling, At their return, up the high
strand, Begin, and cease, and then again begin,
With tremulous cadence slow, and bring The
eternal note of sadness in.
Author Matthew Arnold Title Dover
Beach Genre Poem Date 1851
Notes 1. The surrogate is also a
document 2. Every word is different!
6
Surrogates for Non-textual materials
Text based methods of information retrieval can
search a surrogate for a photograph
Document
Surrogate (catalog record)
See next page for a textual catalog record about
a non-textual item (photograph).
7
Library of Congress catalog record (part)
CREATED/PUBLISHED between 1925 and
1930? SUMMARY U. S. President Calvin Coolidge
sits at a desk and signs a photograph,
probably in Denver, Colorado. A group of
unidentified men look on. NOTES Title supplied
by cataloger. Source Morey Engle. SUBJECTS
Coolidge, Calvin,--1872-1933.
Presidents--United States--1920-1930.
Autographing--Colorado--Denver--1920-1930.
Denver (Colo.)--1920-1930. Photographic
prints. MEDIUM 1 photoprint 21 x 26 cm. (8 x
10 in.)
8
Categories of Descriptive Metadata
Catalog metadata records that have a consistent
structure, organized according to systematic
rules. (Example Library of Congress
Catalog) Abstract a free text record that
summarizes a longer document. Indexing record
less formal than a catalog record, but more
structured than a simple abstract. (Example
Medline)
9
Metadata Format
A metadata format is a set of rules that describe
the content and format of a set of metadata
records, e.g. AACR (Anglo American Cataloging
Rules) / MARC Dublin Core FGDC (Federal
Geographic Data Committee's Content Standard
for Digital Geospatial Metadata) IEEE Standard
for Learning Object Metadata
10
Uses of Metadata in Information Retrieval
Metadata is used in Information Retrieval systems
in conjunction with or instead of full text
indexing For physical objects, e.g.,
books For non-textual materials, e.g.,
pictures, maps, datasets For specialized areas
where high recall is important (e.g., medicine),
or where features such as intended audience are
hard to extract from the text (e.g., educational
level) When people are ignorant of the power of
full text indexing (which is surprisingly common)
11
Uses of Metadata in Information Retrieval
  • Descriptive metadata provides capabilities that
    are not possible with full text indexing
  • Allows fielded searching
  • author "Bacon"
  • Suitable for non-textual material
  • type "picture" and subject "Ithaca"
  • Can be used with controlled vocabulary
  • language "en" (English)

12
Information Retrieval with High Recall
Full-text Indexing (automated) Text only. Most
effective on medium-length documents on related
topics. High recall requires tuning system to
the specific collection and skilled
users. Catalogs and Indexes (created
manually) Can be used for all formats of
material Requires close quality control of
metadata creation High recall requires tuning
system to the specific collection and skilled
users.
13
Using Metadata for Information Retrieval
The basic operation of information retrieval is
to match the way that a user describes an
information requirement (a query), against the
way that items are described (an index). The
success of conventional catalogs (e.g., MARC
Anglo-American Cataloguing Rules) or indexing
services (e.g., Medline) comes from the
combination of precise language to describe
items trained and experienced users to
formulate queries.
14
Library Catalogs
Examples Cornell University Library
catalog http//catalog.library.cornell.edu/
Library of Congress, Prints and
Photographs http//www.loc.gov/rr/print/catalog.
html
15
Origins of Library Catalogs
Bibliographic Objective To bring together
like items To differentiate among similar
ones
Sir Anthony Panizzi, Keeper of Books at the
British Museum (1856-67). His Ninety-One Rules
(1841) were the basis of modern catalog rules.
16
Origins of Library Catalogs
Information Discovery to enable a person
to find a book of which either the author,
title or subject is known to show what the
library has by a given author, on a given
subject, or in a given kind of literature
to assist in the choice of a book as to its
edition (bibliographically) or to its
character (literary or topical).
Charles Ammi Cutter Librarian of the Boston
Athenaeum Rules for a Dictionary Catalog, 1874
17
Origins of Library Catalogs
Classification Division of subject matter
into a hierarchy. Typically used in libraries
to provided a subject-based order for shelving
books.
Melvil Dewey Acting Librarian of Amherst College
(1874) Dewey Decimal system of book
classification, uses the numbers 000 to 999 to
cover the general fields of knowledge and
decimals to fit special subjects.
18
Library Catalogs Technology Changes over the
Years
Materials to be catalogued Originally books
Extended to serials, maps, music, etc., but
concepts still rely heavily on experience with
books Form of catalog Entries in books
(Panizzi) Index cards (Cutter) Online
databases (Kilgour)
19
Shared Cataloguing OCLC
  • OCLC -- Large centralized transaction processing
    system, serves 60,000 libraries in 112 countries
    http//www.oclc.org/
  • When a library catalogs a book it deposits a MARC
    record in OCLC. Other libraries can copy the
    record.
  • saves duplication of cataloguing
  • database has 94 million catalog records
  • database of 1.2 billion holdings from member
    libraries
  • When developed by Fred Kilgour in 1967, OCLC was
    a pioneering computer system (had to develop
    multiple scripts, own network, computer terminal,
    etc.)

20
Catalogs as Investments
Costs Conventional catalog records are created
by skilled librarians. (cost estimate 100 per
record). OCLC's catalog has 110 million
records from 69,000 libraries. Total investment
is several billion dollars. Cataloguing
standards Enable libraries to share
records Combine records of the past with
records created today Allow readers and
librarians to move between libraries
21
Layers of a Library Catalog
Semantics Rules that define the values of
the field and subfield, with instructions for
cataloguers of what data to include and how to
decide when choices have to be made. Syntax
Rules that define the fields and subfields,
whether repeated, optional, etc. Encoding
Rules that define how catalog records are encoded
in a computer system, e.g., XML mark-up.
22
Library Cataloging using the Anglo American
Cataloguing Rules
Anglo American Cataloguing Rules (AACR2)
Rules for each category of material, e.g.,
monographs (books). Specify what fields should
be used and what data to include in each field.
Text strings were originally intended for printed
catalog cards. MARC format An exchange format
for catalog records. Includes encoding rules and
syntax specification. "MARC Catalog" Catalog
in MARC format, where content of each field
follows AACR2.
23
Anglo American Cataloguing Rules
The Anglo American Cataloguing (AACR) rules
provide detailed rules for the choice of
fields the content of the data that goes into
each field the syntax of the data that goes
into each field The rules are an excellent
example of technical writing precise but clear.
For an example, see http//www.infosci.cornell.ed
u/Courses/info4300/2008fa/slides/AACR.pdf
24
Name authority files
  • An Authority File "brings together like items and
    differentiates among similar ones."
  • Which William Phillips of Cardiff?
  • Mark Twain or Samuel Clemens?
  • Karen Sparck Jones or Karen Needham?
  • Epithets
  • of Cardiff
  • doctor
  • Dates
  • 1832 - 1876
  • flourished 1860
  • circa 1832 - 1876

25
Name authority example (part)
  • LC Control Number n 87870182
  • HEADING Arms, Caroline R. (Caroline Ruth)
  • 000 00907cz 2200205n 450
  • 100 10 a Arms, Caroline R. q (Caroline
    Ruth)
  • 400 10 w nna a Arms, Caroline Ruth
  • 400 10 a Arms, C. R. q (Caroline Ruth)
  • 670 __ a LC data base, 8/24/87 b (hdg.
    Arms, Caroline Ruth usage Caroline R. Arms, C.
    R. Arms)
  • 670 __ a Campus networking strategies,
    1988 b CIP t.p. (Caroline Arms)
  • 670 __ a Phone call to pub., 2/10/88 b
    (Caroline Ruth Arms studied at Oxford)
  • 670 __ a Campus strategies for libraries
    and electronic information, c1990 b CIP t.p.
    (Caroline Arms) data sheet (b. 10-24-45)

26
Subject information
Library of Congress Subject Headings Academic
libraries--United States--Automation Hierarchical
classification Library of Congress call
number Z675.U5C16 Dewey Decimal
Classification 027.7 Creation and maintenance
of lists of subject headings and classifications
is a never ending task.
27
MARC Format
The MARC format was developed in the late 1960s
as a tagging scheme for exchanging catalog
records on magnetic tape. It remains the standard
way to represent such data. At present, MARC is
steadily being converted (slowly) to modern
computing formats, e.g., Unicode, XML.
28
MARC Monograph catalog record
Citation Caroline R. Arms, editor, Campus
strategies for libraries and electronic
information. Bedford, MA Digital Press, 1990.
29
MARC fields (part)
tag value 050 Z675.U5C16 1990 082
027.7/0973 20 245 Campus strategies for
libraries and electronic title statement
information/Caroline Arms, editor. 260
Bedford, Mass. Digital Press, c1990.
publisher 300 xi, 404 p. ill. 24 cm.

collation 440 EDUCOM strategies series on
information technology

series title 504 Includes
bibliographical references (p. 373-381). 020
ISBN 1-55558-036-X 34.95
30
MARC fields (continued)
650 Academic libraries--United
States--Automation.
subject
heading 650 Libraries and electronic
publishing--United States. 650 Library
information networks--United States. 650
Information technology--United States. 700
Arms, Caroline R. (Caroline Ruth)
31
Notes on MARC
  • A great achievement
  • Developed in 1960s
  • Magnetic tape exchange format for printing
    catalog records
  • The dawn of computing
  • mixed upper and lower case
  • variable length fields,
  • repeated fields
  • non-Roman scripts
  • 100(?) million records with standard content
    and format
  • Thousands of trained librarians (millions?)

32
Notes on MARC
  • A great problem
  • Not designed for computer algorithms
  • One record per item (poor links between
    records)
  • Tied to traditional materials and
    traditional practices
  • Not Unicode
  • 100s of million records at 100 -- 10
    billion
  • A classic legacy system!

33
MARC Encoding
MARC encoding 2600abcBedford, Mass.
Digital Press,c1990. tag 260 subfield
a Bedford, Mass. subfield b Digital
Press, subfield c c1990. Definitely not a
modern encoding!
Note that the content is designed to be part of a
printed catalog record and is not in a convenient
format for computer manipulation.
34
Modernizing MARC
1. Keep the content of the catalog
record 2. Convert to Unicode for representing
scripts 3. Convert to XML for tagging cataloguing
metadata. MARCXML (MARC 21 XML) http//www.loc.go
v/standards/marcxml/ Direct conversion to XML
tagging Metadata Object Description Schema
(MODS) http//www.loc.gov/standards/mods/ Subse
t of MARC with data clean-up
35
MARC XML
Simple XML Schema The schema retains the
semantics of MARC. Fields are treated as elements
with the tag as an attribute and indicators
treated as attributes. Subfields are treated as
subelements with the subfield code as an
attribute. Lossless Conversion of MARC to
XML Roundtripability from XML back to
MARC Data Presentation by writing a XML
stylesheet Validation of MARC
data Extensibility
36
MODS Example (extracts)
ltmodsgt lttitleInfogt lttitlegtSound and fury
lt/titlegt ltsubTitlegtthe making of the
punditocracy /lt/subTitlegt lt/titleInfogt ltname
type"personal"gt ltnamePartgtAlterman,
Ericlt/namePartgt ltrolegt ltroleTerm
type"text"gtcreatorlt/roleTermgt
lt/rolegt lt/namegt
37
MODS Example (extracts)
lttypeOfResourcegttextlt/typeOfResourcegt ltoriginInfogt
ltplacegt ltplaceTerm
type"text"gtIthaca, N.Ylt/placeTermgt
lt/placegt ltpublishergtCornell University
Presslt/publishergt ltdateIssuedgtc1999lt/dateIssu
edgt lt/originInfogt ltlanguagegt
ltlanguageTerm authority"iso639-2b"
type"code"gtenglt/languageTermgt lt/
languagegt lt/modsgt
Write a Comment
User Comments (0)
About PowerShow.com