CS 4300 INFO 4300 Information Retrieval - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

CS 4300 INFO 4300 Information Retrieval

Description:

(a) Resources separated into categories of related materials. ... Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 37
Provided by: wya2
Category:

less

Transcript and Presenter's Notes

Title: CS 4300 INFO 4300 Information Retrieval


1
CS 4300 / INFO 4300 Information Retrieval
Lecture 27 Searching library collections 2
Building the all-digital library
2
Course Administration
Office hours No office hours Tuesday, December 9
3
Effective Information Discovery Before Digital
Information
  • Searching
  • (a) Resources separated into categories of
    related materials. Each category organized,
    indexed and searched separately.
  • Catalogs and indexes built on tightly controlled
    metadata standards, e.g., MARC, MeSH headings,
    etc.
  • Search engines used Boolean operators and
    fielding searching.
  • Query languages and search interfaces assumed a
    trained user.
  • Resources were physical items.

4
Effective Information Discovery Homogeneous
Digital Information
Comprehensive metadata with Boolean retrieval
Can be excellent for well-understood categories
of material, but requires standardized metadata
and relatively homogeneous content (e.g., MARC
catalog). Full text indexing with ranked
retrieval Can be excellent, but methods
developed and validated for relatively
homogeneous textual material (e.g., TREC ad hoc
track).
5
Mixed Digital Content
Examples NSDL-funded collections at
Cornell Atlas. Data sets of earthquakes,
volcanoes, etc. Reuleaux. Digitized kinematics
models from the nineteenth century Laboratory of
Ornithology. Sound recording, images, videos of
birds and other animals. Nuprl. Logic-based tools
to support programming and to implement formal
computational mathematics.
6
Mixed Metadata the Chimera of Standardization
  • Technical reasons
  • Characteristics of formats and genres
  • Differing user needs
  • Social and cultural reasons
  • Economic factors
  • Installed base

7
Standardization Function versus cost of
acceptance
Cost of acceptance
Few adopters
Many adopters
Function
8
Example security
Cost of acceptance
Public key infrastructure
Login ID and password
IP address
Function
9
Example metadata standards
Cost of acceptance
MARC
Dublin Core
Function
Free text
10
Information Discovery in a Messy World
Building blocks Brute force computation The
expertise of users -- human in the
loop Methods (a) Better understanding of how and
why users seek for information (b) Relationships
and context information (c) Multi-modal
information discovery (d) User interfaces for
exploring information
11
Understanding How and Why Users Seek for
Information
Homogeneous content All documents are assumed
equal Criterion is relevance (binary
measure) Goal is to find all relevant documents
(high recall) Hits ranked in order of similarity
to query Mixed content Some documents are more
important than other Goal is to find most useful
documents on a topic and then browse Hits ranked
in order that combines importance and similarity
to query
12
Case Study
Information discovery in the National Science
Foundation's National Science Digital Library
(NSDL). The goal of the NSDL is to be a digital
library for all aspects of science education,
where science and education are very broadly
defined. http//nsdl.org
13
Why Technology in Education? Why a Digital
Library for Education?
Higher Education. U.S. higher education is the
best in the world, but it is very expensive. How
can we keep the quality while lowering the cost?
K-12. The best K-12 education in the U.S. is
excellent, but much is mediocre or worse. How
can the best be made available to
all? Technology-enhanced education offers a way
to increase the productivity of the skilled
people who teach in both higher and K-12
education.
14
Why a Digital Library for Science Education?
Excellent teaching materials have been
developed... but they are not being used
effectively. The NSDL provides organization and
access for teachers and students Preservation
and reuse. Searching and browsing. Links
between teaching materials and their educational
use.
15
The NSDL Architecture
Educational materials are scattered across the
Internet
State standards
Math Forum
NASA
Ask a Scientist
Scientific American
16
The All-Digital Library One Library, Many
Portals
Different Groups of Users Need Different Views of
the Library View for general users. Views
by discipline (e.g., mathematics) or expertise
(e.g., middle school). APIs for computer agents
that act on behalf of users (eScience)
17
The All-Digital Library A Spectrum of
Interoperability
The Problem Conventional approaches require the
suppliers of information to support agreements
(technical, content, and business) But the
all-digital library will have thousands of very
different sources of information ... most of whom
are not directly part of the library
18
Approaches to interoperability
The conventional approach ? Wise people develop
standards protocols, formats, etc. ? Everybody
implements the standards. ? This creates an
integrated, distributed system.
Unfortunately ... ? Standards are expensive to
adopt. ? Concepts are continually changing. ?
Systems are continually changing. ? Different
people have different ideas.
19
Interoperability is about agreements
Technical agreements cover formats, protocols,
security systems so that messages can be
exchanged, etc. Content agreements cover the
data and metadata, and include semantic
agreements on the interpretation of the messages.
Organizational agreements cover the ground
rules for access, for changing collections and
services, payment, authentication, etc. The
challenge is to create incentives for independent
digital libraries to adopt agreements
20
The Spectrum of Interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not Web
crawlers cooperate services must and search
engines seek out information
21
Architectural Choices Federated Searching
(Multicast)
User
User interface service
Collections
The federated collections all support the same
protocol and formats. To search the the
collections, a user interface service broadcasts
a query to all the collections and combines the
results.
22
Federated Searching Z39.50
Z39.50 is a stateful (session) protocol that is
used for federated searching in conjunction with
content standards, such as MARC. Examples of
Z39.50 session Open connection Begin
session Interactive session End session Close
connection Client and server remember the results
of previous transactions (e.g., authentication,
partial results) until session is closed.
23
Federated Searching Difficulties
To be successful, federated searching
needs All collections to implement a complex
hierarchy of formats and protocols. All
collections to support shared models business and
security. Reliable servers (since the overall
system performance depends on the least reliable
component).
24
Architecture Central Repository
Repository
The repository holds information about every
collection and item in the library
25
Search Service in the All-digital Library
Full Text or Metadata? Full text indexing is
excellent, but is not possible for all materials
(non-textual, supplier may not provide access for
indexing). Comprehensive metadata is available
for very few of the materials. What Architecture
to Use? Few collections support an established
search protocol (e.g., Z39.50).
26
Mixed Search Service
Repository
Search Service
Collections
The search service combines metadata from the
Repository and full text from the collections
27
Standards Implemented in the NSDL Repository
Phase 1
Object model Metadata Dublin Core with
educational extensions Ingest and
redistribution Open Archives Initiative,
Protocol for Metadata Harvesting
Collection
URL
URL
Items
28
NSDL Search Service Phase 1
  • Approach
  • Collections map metadata to Dublin Core, provide
    via Open Archives protocol.
  • Search service augments Dublin Core metadata with
    indexing of full-text where available.
  • The search engine is Lucene (tf.idf weighting)
  • User interface returns snippets derived from the
    metadata, links to full content and to metadata.

29
Mixed Search Service
  • Weaknesses
  • Ranking by similarity to query not sufficient.
  • Snippets do not indicate why item was returned
    (e.g., terms in full text but not in metadata).
  • Dublin Core records provide limited information.
  • (d) Browsing environment difficult with mixed
    content.
  • (e) Many users begin their search with a Web
    search engine (e.g., Google or Yahoo).

30
The All-Digital Library and the Web
Many people will find materials through Web
search engines. Therefore the library must be
indexed by them.
Repository
Collections
31
Information Retrieval and Change
This is a period of rapid change in Information
Retrieval Users Searchers carried out by the
end user, not by a professionally trained
intermediary -gt Many quick searches, few
comprehensive searches -gt Emphasis on simple
user interfaces -gt Movement away from complex
query languages -gt Limited understanding of
controlled vocabulary -gt Dangers of the Boolean
model
32
Information Retrieval and Change
This is a period of rapid change in Information
Retrieval Materials Physical documents becoming
less important relative to digital materials -gt
Possible to browse documents interactively in
conjunction with search Very large
heterogeneous collections, often changing
rapidly -gt Impossible to provide
human-generated metadata for everything
33
Information Retrieval and Change
This is a period of rapid change in Information
Retrieval Computer hardware Very rapid
improvements in all aspect of hardware and
widespread broadband networks -gt Huge digital
collections -gt Comprehensive indexes generated
automatically -gt Powerful user interfaces -gt
Interactive retrieval of all document types -gt
Data mining of logs, etc.
34
Information Retrieval and Change
This is a period of rapid change in Information
Retrieval Information Science Measures of
precision-recall, replaced by concepts of
importance of documents. Searching as part of an
information discovery process with incrementally
changing goals and complex interactions of
searching, browsing. Use of contextual
information, such as hyperlinks, anchor text,
etc.
35
Information Retrieval and Change
This is a period of rapid change in Information
Retrieval Computer Science Very large-scale
computer clusters with functional programming
concepts, such as MapReduce. Graphical
algorithms, such as PageRank. Machine learning
methods to tune systems, replacing empirical
heuristics.
36
Information Retrieval and People
Information retrieval and people The innovations
in information retrieval are done by people like
you. Many of them have been from Cornell. As
you learned from the discussion papers, some of
them made mistakes or followed unsuccessful
paths, but research is like that. Hopefully,
future courses in information retrieval will
include innovations that some of you have
contributed to. The end
Write a Comment
User Comments (0)
About PowerShow.com