Lecture 20: Evaluation - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 20: Evaluation

Description:

Title: PowerPoint Presentation Author: Valued Gateway Client Last modified by: Marc Davis Created Date: 9/3/2002 3:52:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 61

Provided by: ValuedGate2515

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 20: Evaluation

1
Lecture 20 Evaluation
SIMS 202 Information Organization and Retrieval

Prof. Ray Larson Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 1030 am - 1200 pm
Fall 2002
http//www.sims.berkeley.edu/academics/courses/is2
02/f02/

2
Lecture Overview

Review
Lexical Relations
WordNet
Can Lexical and Semantic Relations be Exploited
to Improve IR?
Evaluation of IR systems
Precision vs. Recall
Cutoff Points
Test Collections/TREC
Blair Maron Study

Credit for some of the slides in this lecture
goes to Marti Hearst and Warren Sack
3
Syntax

The syntax of a language is to be understood as a
set of rules which accounts for the distribution
of word forms throughout the sentences of a
language
These rules codify permissible combinations of
classes of word forms

4
Semantics

Semantics is the study of linguistic meaning
Two standard approaches to lexical semantics
(cf., sentential semantics and logical
semantics)
(1) Compositional
(2) Relational

5
Pragmatics

Deals with the relation between signs or
linguistic expressions and their users
Deixis (literally pointing out)
E.g., Ill be back in an hour depends upon the
time of the utterance
Conversational implicature
A Can you tell me the time?
B Well, the milkman has come. I dont know
exactly, but perhaps you can deduce it from some
extra information I give you.
Presupposition
Are you still such a bad driver?
Speech acts
Constatives vs. performatives
E.g., I second the motion.
Conversational structure
E.g., turn-taking rules

6
Major Lexical Relations

Synonymy
Polysemy
Metonymy
Hyponymy/Hyperonymy
Meronymy
Antonymy

7
Thesauri and Lexical Relations

Polysemy Same word, different senses of meaning
Slightly different concepts expressed similarly
Synonyms Different words, related senses of
meanings
Different ways to express similar concepts
Thesauri help draw all these together
Thesauri also commonly define a set of relations
between terms that is similar to lexical
relations
BT, NT, RT

8
WordNet

Started in 1985 by George Miller, students, and
colleagues at the Cognitive Science Laboratory,
Princeton University
Can be downloaded for free
www.cogsci.princeton.edu/wn/
In terms of coverage, WordNets goals differ
little from those of a good standard
college-level dictionary, and the semantics of
WordNet is based on the notion of word sense that
lexicographers have traditionally used in writing
dictionaries. It is in the organization of that
information that WordNet aspires to innovation.
(Miller, 1998, Chapter 1)

9
WordNet Size
WordNet Uses Synsets sets of synonymous terms

POS Unique Synsets
Strings
Noun 107930 74488
Verb 10806 12754
Adjective 21365 18523
Adverb 4583 3612
Totals 144684 109377

10
Structure of WordNet
11
Structure of WordNet
12
Structure of WordNet
13
Lexical Relations and IR

Recall that most IR research has primarily looked
at statistical approaches to inferring the
topicality or meaning of documents
I.e., Statistics imply Semantics
Is this really true or correct?
How has (or might) WordNet be used to provide
more functionality in searching?
What about other thesauri, classification
schemes, and ontologies?

14
Using NLP

Strzalkowski

Text
NLP
repres
Dbase search
TAGGER
PARSER
TERMS
NLP
15
NLP IR Possible Approaches

Indexing
Use of NLP methods to identify phrases
Test weighting schemes for phrases
Use of more sophisticated morphological analysis
Searching
Use of two-stage retrieval
Statistical retrieval
Followed by more sophisticated NLP filtering

16
Can Statistics Approach Semantics?

One approach is the Entry Vocabulary Index (EVI)
work being done here
(The following slides are from my presentation at
JCDL 2002)

17
What is an Entry Vocabulary Index?

EVIs are a means of mapping from users
vocabulary to the controlled vocabulary of a
collection of documents

18
SolutionEntry Level Vocabulary Indexes.
Index
EVI
pass mtr veh spark ign eng
Automobile
19
Digital library resources

Statistical association
20
Lecture Overview

Review
Lexical Relations
WordNet
Can Lexical and Semantic Relations be Exploited
to Improve IR?
Evaluation of IR systems
Precision vs. Recall
Cutoff Points
Test Collections/TREC
Blair Maron Study

Credit for some of the slides in this lecture
goes to Marti Hearst and Warren Sack
21
IR Evaluation

Why Evaluate?
What to Evaluate?
How to Evaluate?

22
Why Evaluate?

Determine if the system is desirable
Make comparative assessments
Is system X better than system Y?
Others?

23
What to Evaluate?

How much of the information need is satisfied
How much was learned about a topic
Incidental learning
How much was learned about the collection
How much was learned about other topic
How inviting the system is

24
Relevance

In what ways can a document be relevant to a
query?
Answer precise question precisely
Partially answer question
Suggest a source for more information
Give background information
Remind the user of other knowledge
Others...

25
Relevance

How relevant is the document?
For this user for this information need
Subjective, but
Measurable to some extent
How often do people agree a document is relevant
to a query?
How well does it answer the question?
Complete answer? Partial?
Background Information?
Hints for further exploration?

26
What to Evaluate?

What can be measured that reflects users ability
to use system? (Cleverdon 66)
Coverage of information
Form of presentation
Effort required/ease of use
Time and space efficiency
Recall
Proportion of relevant material actually
retrieved
Precision
Proportion of retrieved material actually relevant

Effectiveness
27
Relevant vs. Retrieved
All Docs
Retrieved
Relevant
28
Precision vs. Recall
29
Why Precision and Recall?

Get as much good stuff while at the same time
getting as little junk as possible

30
Retrieved vs. Relevant Documents
Very high precision, very low recall
31
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 in fact)
32
Retrieved vs. Relevant Documents
High recall, but low precision
33
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
34
Precision/Recall Curves

There is a tradeoff between Precision and Recall
So measure Precision at different levels of
Recall
Note this is an AVERAGE over MANY queries

35
Precision/Recall Curves

Difficult to determine which of these two
hypothetical results is better

x
precision
x
x
x
recall
36
TREC (Manual Queries)
37
Document Cutoff Levels

Another way to evaluate
Fix the number of RELEVANT documents retrieved at
several levels
Top 5
Top 10
Top 20
Top 50
Top 100
Top 500
Measure precision at each of these levels
Take (weighted) average over results
This is a way to focus on how well the system
ranks the first k documents

38
Problems with Precision/Recall

Cant know true recall value
Except in small collections
Precision/Recall are related
A combined measure sometimes more appropriate
Assumes batch mode
Interactive IR is important and has different
criteria for successful searches
We will touch on this in the UI section
Assumes a strict rank ordering matters

39
Relation to Contingency Table
Doc is Relevant Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved c d

Accuracy (ad) / (abcd)
Precision a/(ab)
Recall ?
Why dont we use Accuracy for IR Evaluation?
(Assuming a large collection)
Most docs arent relevant
Most docs arent retrieved
Inflates the accuracy value

40
The E-Measure

Combine Precision and Recall into one number (van
Rijsbergen 79)

P precision R recall b measure of relative
importance of P or R For example, b 0.5 means
user is twice as interested in precision as
recall
41
F Measure (Harmonic Mean)
42
Test Collections

Cranfield 2
1400 Documents, 221 Queries
200 Documents, 42 Queries
INSPEC 542 Documents, 97 Queries
UKCIS -- gt 10000 Documents, multiple sets, 193
Queries
ADI 82 Document, 35 Queries
CACM 3204 Documents, 50 Queries
CISI 1460 Documents, 35 Queries
MEDLARS (Salton) 273 Documents, 18 Queries

43
TREC

Text REtrieval Conference/Competition
Run by NIST (National Institute of Standards
Technology)
1999 was the 8th year - 9th TREC in early
November
Collection gt6 Gigabytes (5 CRDOMs), gt1.5
Million Docs
Newswire full text news (AP, WSJ, Ziff, FT)
Government documents (federal register,
Congressional Record)
Radio Transcripts (FBIS)
Web subsets (Large Web separate with 18.5
Million pages of Web data 100 Gbytes)
Patents

44
TREC (cont.)

Queries Relevance Judgments
Queries devised and judged by Information
Specialists
Relevance judgments done only for those documents
retrievednot entire collection!
Competition
Various research and commercial groups compete
(TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)
Results judged on precision and recall, going up
to a recall level of 1000 documents
Following slides from TREC overviews by Ellen
Voorhees of NIST

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
Sample TREC Query (Topic)
ltnumgt Number 168 lttitlegt Topic Financing
AMTRAK ltdescgt Description A document will
address the role of the Federal Government in
financing the operation of the National Railroad
Transportation Corporation (AMTRAK) ltnarrgt
Narrative A relevant document must provide
information on the governments responsibility to
make AMTRAK an economically viable entity. It
could also discuss the privatization of AMTRAK as
an alternative to continuing government
subsidies. Documents comparing government
subsidies given to air and bus transportation
with those provided to AMTRAK would also be
relevant.
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
TREC

Benefits
Made research systems scale to large collections
(pre-WWW)
Allows for somewhat controlled comparisons
Drawbacks
Emphasis on high recall, which may be unrealistic
for what most users want
Very long queries, also unrealistic
Comparisons still difficult to make, because
systems are quite different on many dimensions
Focus on batch ranking rather than interaction
There is an interactive track

57
TREC is Changing

Emphasis on specialized tracks
Interactive track
Natural Language Processing (NLP) track
Multilingual tracks (Chinese, Spanish)
Filtering track
High-Precision
High-Performance
http//trec.nist.gov/

58
Blair and Maron 1985

A classic study of retrieval effectiveness
Earlier studies were on unrealistically small
collections
Studied an archive of documents for a legal suit
350,000 pages of text
40 queries
Focus on high recall
Used IBMs STAIRS full-text system
Main Result
The system retrieved less than 20 of the
relevant documents for a particular information
need
Lawyers thought they had 75
But many queries had very high precision

59
Blair and Maron (cont.)

How they estimated recall
Generated partially random samples of unseen
documents
Had users (unaware these were random) judge them
for relevance
Other results
Two lawyers searches had similar performance
Lawyers recall was not much different from
paralegals

60
Blair and Maron (cont.)

Why recall was low
Users cant foresee exact words and phrases that
will indicate relevant documents
accident referred to by those responsible as
event, incident, situation, problem,
Differing technical terminology
Slang, misspellings
Perhaps the value of higher recall decreases as
the number of relevant documents grows, so more
detailed queries were not attempted once the
users were satisfied