Evaluation of Evaluation in Information Retrieval - Tefko Saracevic - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Evaluation of Evaluation in Information Retrieval - Tefko Saracevic

Description:

Evaluation is assessing performance or value of a system, process, procedure, ... regularly in trade magazines such as: Online, Online Review, Searcher, etc... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 21

Provided by: comminfo

Learn more at: https://comminfo.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Evaluation in Information Retrieval - Tefko Saracevic

1
Evaluation of Evaluation in Information
Retrieval- Tefko Saracevic

Historical Approach to IR Evaluation.

2
Saracevics Definition of Evaluation

Evaluation is assessing performance or value of a
system, process, procedure, product, or policy.

3
Evaluation Requirements

A system
A prototype
A criterion or criteria
Objectives of system
Measures
Recall and precision
A measuring instrument
Judgments by analysts/users
Methodology
Procedures, i.e.. for TREC

4
Levels of Evaluation

Engineering level
Hardware and Software.
Input level
Contents of system coverage.
Processing level
Questions regarding the way inputs are
processed assessment of algorithms, techniques
and approaches.

5
Levels of Evaluation cont.

Output level
Interactions with the system and obtained
output.
Use and user level
Applications used for given tasks.
Social level
Effects on research, productivity and decision
making.
Eco-efficient level
Economic efficiency questions to be determined
at each level of analysis.

6
Two more classes of evaluation.

End user performance and use
Meyer Ruiz, 1990 others summarized in
Dalrymple Roderer, 1994.
Markets, products, and services from information
industry.
Rapp et al., 1990. These evaluations appear
regularly in trade magazines such as Online,
Online Review, Searcher, etc... .

7
Output and user and Use level evaluations

Fenichel (1981)
Borgman (1989)
Saracevic, Kantor, Chamis Trivison (1990)
Haynes et al. (1990)
Fidel (1991)
Spink (1995)

8
Processing level ApproachesToy Collections

Cranfeld (Cleverdon, Mills Keens, 1966)
SMART (Salton 1971, 1989)
TREC (Harmon, 1995)

9
Studies conducted on the social level
Evaluating impact of IR area specific systems.

Impact of MEDLINE on clinical decision making
(Lindberg et al., 1993)

10
Criteria in IR Evaluation

Relevance as core criteria, Kent et. al. 1955.
criteria such as utility and search length did
not stick.
Cranfeld, SMART, TREC all revolved around the
phenomenon of relevance.
Keeping evaluation out of engineering level by
implications of use.
Relevance is a complex human process not of a
binary nature.
Dependent on circumstances

11
Output and User and Use level evaluations

Employ a multiplicity of criteria.
related to utility, success, completeness,
worth, satisfaction, value, efficiency, cost
etc. . .
More emphasis on interaction.

12
Market, Business, Industry Evaluations

Similar to user use level
TQM Total Quality Movement
Cost-effectiveness
Debate over relevancy is isolated in IR.

13
Isolation of studies within levels of origin.

Algorithms
Users and Uses
Market products/services
Social Impacts

14
Process level measures of evaluation

Precision
Ratio of relevant items retrieved to total
retrieved items or, probability that a retrieved
item is relevant.
Recall
Ratio of relevant items retrieved to all
available relevant items in a particular file or,
the probability given that an item retrieved will
be relevant.

15
Measures User Use level.

Semantic differentials
Likert scales
Which measures to use? How do measures compare?
How do they effect the results?
See, Su, 1992

16
Measuring Instruments

Mainly, people, are the instruments that
determine relevancy of retrieved items.
Who are the judges? What effects their
judgments? How do they effect the results?

17
Methodological issues surrounding notions of
validity and reliability.

Collection How are items selected?
Requests How are they generated?
Searching How is it conducted?
Results - How are they obtained?
Analysis What comparisons are made?
Interpretation / Generalization
- What are the conclusions? Are they warranted
on basis of results? How generalizable are the
findings?

18
Evaluation outside of traditional IR, i.e.
Digital Libraries and the Internet.

Evaluation is limited to software and
engineering levels.
Evaluated on their own level.
Many applications are well received, however, on
most output, user and use levels these
applications are found to be frustrating,
unpredictable, wasteful, expensive, trivial
unreliable and hard to use!

19
Dont through the baby out with the bath water!

Dervin and Nilan, 1986
Article Swung to the other end of the pendulum
and called for paradigmatic shift.
From system centered to user centered
evaluations.
Both user and system centered approaches are
needed.

20
Keep it realistic!