Title: Evaluation of Evaluation in Information Retrieval - Tefko Saracevic
1Evaluation of Evaluation in Information
Retrieval- Tefko Saracevic
- Historical Approach to IR Evaluation.
2Saracevics Definition of Evaluation
- Evaluation is assessing performance or value of a
system, process, procedure, product, or policy.
3Evaluation Requirements
- A system
- A prototype
- A criterion or criteria
- Objectives of system
- Measures
- Recall and precision
- A measuring instrument
- Judgments by analysts/users
- Methodology
- Procedures, i.e.. for TREC
4Levels of Evaluation
- Engineering level
- Hardware and Software.
- Input level
- Contents of system coverage.
- Processing level
- Questions regarding the way inputs are
processed assessment of algorithms, techniques
and approaches.
5Levels of Evaluation cont.
- Output level
- Interactions with the system and obtained
output. - Use and user level
- Applications used for given tasks.
- Social level
- Effects on research, productivity and decision
making. - Eco-efficient level
- Economic efficiency questions to be determined
at each level of analysis.
6Two more classes of evaluation.
- End user performance and use
- Meyer Ruiz, 1990 others summarized in
Dalrymple Roderer, 1994. - Markets, products, and services from information
industry. - Rapp et al., 1990. These evaluations appear
regularly in trade magazines such as Online,
Online Review, Searcher, etc... .
7Output and user and Use level evaluations
- Fenichel (1981)
- Borgman (1989)
- Saracevic, Kantor, Chamis Trivison (1990)
- Haynes et al. (1990)
- Fidel (1991)
- Spink (1995)
8Processing level ApproachesToy Collections
- Cranfeld (Cleverdon, Mills Keens, 1966)
- SMART (Salton 1971, 1989)
- TREC (Harmon, 1995)
9Studies conducted on the social level
Evaluating impact of IR area specific systems.
- Impact of MEDLINE on clinical decision making
(Lindberg et al., 1993)
10Criteria in IR Evaluation
- Relevance as core criteria, Kent et. al. 1955.
- criteria such as utility and search length did
not stick. - Cranfeld, SMART, TREC all revolved around the
phenomenon of relevance. - Keeping evaluation out of engineering level by
implications of use. - Relevance is a complex human process not of a
binary nature. - Dependent on circumstances
11Output and User and Use level evaluations
- Employ a multiplicity of criteria.
- related to utility, success, completeness,
worth, satisfaction, value, efficiency, cost
etc. . . - More emphasis on interaction.
-
12Market, Business, Industry Evaluations
- Similar to user use level
- TQM Total Quality Movement
- Cost-effectiveness
- Debate over relevancy is isolated in IR.
13Isolation of studies within levels of origin.
- Algorithms
- Users and Uses
- Market products/services
- Social Impacts
14Process level measures of evaluation
- Precision
- Ratio of relevant items retrieved to total
retrieved items or, probability that a retrieved
item is relevant. - Recall
- Ratio of relevant items retrieved to all
available relevant items in a particular file or,
the probability given that an item retrieved will
be relevant.
15Measures User Use level.
- Semantic differentials
- Likert scales
- Which measures to use? How do measures compare?
How do they effect the results? - See, Su, 1992
16Measuring Instruments
- Mainly, people, are the instruments that
determine relevancy of retrieved items. - Who are the judges? What effects their
judgments? How do they effect the results?
17Methodological issues surrounding notions of
validity and reliability.
- Collection How are items selected?
- Requests How are they generated?
- Searching How is it conducted?
- Results - How are they obtained?
- Analysis What comparisons are made?
- Interpretation / Generalization
- - What are the conclusions? Are they warranted
on basis of results? How generalizable are the
findings?
18Evaluation outside of traditional IR, i.e.
Digital Libraries and the Internet.
- Evaluation is limited to software and
engineering levels. - Evaluated on their own level.
- Many applications are well received, however, on
most output, user and use levels these
applications are found to be frustrating,
unpredictable, wasteful, expensive, trivial
unreliable and hard to use!
19Dont through the baby out with the bath water!
- Dervin and Nilan, 1986
- Article Swung to the other end of the pendulum
and called for paradigmatic shift. - From system centered to user centered
evaluations. - Both user and system centered approaches are
needed.
20Keep it realistic!
- Possible solution
- The integration of all levels of evaluation for
a comprehensive real to life analysis.