IR Evaluation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

IR Evaluation

Description:

ROC Curves. Receiver Operating Characteristic Curve ... ROC Curve of Signal Detection. 21. Utility. Combines rank information with value of items: ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 30
Provided by: tcnj
Category:
Tags: evaluation | roc

less

Transcript and Presenter's Notes

Title: IR Evaluation


1
IR Evaluation
  • Evaluate performance of an IR system
  • Retrieval accuracy
  • Retrieval efficiency
  • User satisfaction
  • May seem a secondary issue, but crucial to
    progress
  • Rigorous evaluations initiated by Cleverdon in
    1960s

2
IR Evaluation Concepts
A Retrieved documents
R Relevant documents



FA false alarms
T true positives
M missed
3
Primary Evaluation Metrics
  • Recall is T/R
  • Precision is T/A
  • These definitions are not always practical
  • R may not be known
  • We are more interested in precision distribution
    than the final value

4
Recall-Precision Correlation
  • R D003, D056, D129
  • A
  • D123 P0.00 R0.00
  • D084 P0.00 R0.00
  • D056 P0.33 R0.33
  • D006 P0.25 R0.33
  • D008 P0.20 R0.33
  • D009 P0.16 R0.33
  • D511 P0.14 R0.33
  • D129 P0.25 R0.66
  • D187 P0.22 R0.66
  • .
  • D003 P0.20 R1.00

5
RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
33
66
100
recall
6
Interpolated RP Curves
  • Measures precision at selected recall levels
  • 10, 20, etc. reference intervals
  • Interpolation maximum system capability
  • modulo ordering within reference intervals
  • Maximum precision within interval
  • Measures performance over a set of queries
  • Average over queries at reference intervals

7
Interpolating Precision
  • Select reference intervals
  • Usually 01020308090100
  • There are 11 reference points
  • Calculate precision in each interval
  • int(P,I) maximum actual P in 0,I1)
  • extend left from first non-zero value

8
Interpolated Recall-Precision
  • D123 R0.00
  • D084 P0.33 R0.00
  • P0.33 R0.10
  • P0.33 R0.20
  • P0.33 R0.30
  • D056 P0.33 R0.33 max for R ? 0.33
  • D006 R0.33
  • D008 R0.33
  • D009 R0.33
  • D511 R0.33
  • P0.25 R0.40
  • P0.25 R0.50
  • P0.25 R0.60
  • D129 P0.25 R0.66 max for R ? 0.66
  • D187 R0.66
  • P0.30 R0.70
  • P0.30 R0.80
  • P0.30 R0.90
  • D003 P0.20 R1.00 max for R ? 1.00

9
Interpolated RecallPrecision Curve
precision
highest precision points
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80
90 100
recall
10
Recall-Precision Example 2
  • R D003, D056, D129
  • A
  • D123 P0.00 R0.00
  • D084 P0.00 R0.00
  • D056 P0.33 R0.33
  • D006 P0.25 R0.33
  • D008 P0.20 R0.33
  • D009 P0.16 R0.33
  • D511 P0.14 R0.33
  • D129 P0.25 R0.66
  • D187 P0.22 R0.66
  • D003 P0.30 R1.00

11
Interpolated RecallPrecision Curve
highest precision points
precision
33 30 25 22 20 16 14
0 10 20 30 33 40 50 60 66 70 80 90 100
recall
12
Interpolated RP Average
Retrieved 1000 Relevant 125
? more relevants Rel_ret
77 Interpolated Recall - Precision at 0.00
1.0000 ? easier query? at 0.10
0.9375 at 0.20 0.5102 at 0.30
0.3125 at 0.40 0.2167 at 0.50
0.1280 at 0.60 0.0787 at 0.70
0.0000 at 0.80 0.0000 at
0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.2647 11-point
average
13
Interpolated RP Average
Total number of documents over all queries
Retrieved 1000 Relevant 17 ?
few relevants Rel_ret
2 Interpolated Recall - Precision Averages
at 0.00 0.0556 ? tough query? at 0.10
0.0556 at 0.20 0.0000 at 0.30
0.0000 at 0.40 0.0000 at
0.50 0.0000 at 0.60 0.0000
at 0.70 0.0000 at 0.80 0.0000
at 0.90 0.0000 at 1.00 0.0000
Average precision (non-interpolated) over all
rel docs 0.0052
14
RecallPrecision Averages
Total number of documents over all queries (50)
Retrieved 50000 Relevant 4674
Rel_ret 1998 Interpolated Recall -
Precision Averages at 0.00 0.8140
at 0.10 0.5691 at 0.20 0.3672
at 0.30 0.2665 at 0.40 0.1607
at 0.50 0.0993 11-points of
reference at 0.60 0.0528 at 0.70
0.0292 at 0.80 0.0039 at 0.90
0.0025 at 1.00 0.0000 Average
precision (non-interpolated) over all rel docs
0.1898 11-point average
15
RecallPrecision Curve
16
Precision at document counts
  • Measures precision after each r documents
    retrieved 5, 10, 15, etc.
  • D123 P0.00 R0.00
  • D084 P0.00 R0.00
  • D056 P0.33 R0.33 (R-Precision0.33)
  • D006 P0.25 R0.33
  • D008 P0.20 R0.33
  • D009 P0.16 R0.33
  • D511 P0.14 R0.33
  • D129 P0.25 R0.66
  • D187 P0.22 R0.66
  • D038 P0.20 R0.66
  • .
  • D003 P0.20 R1.00

17
Non-interpolated R-Precision
Retrieved 1000 Relevant 125 Rel_ret
77 Precision At 5 docs 1.0000
At 10 docs 1.0000 At 15 docs 0.9333
At 20 docs 0.8500 At 30 docs 0.7000
At 100 docs 0.3300 At 200 docs 0.2350
At 500 docs 0.1260 At 1000 docs
0.0770 R-Precision precision after R
num_rel for a query docs retrieved Exact
0.2960
18
Further Evaluation Metrics
  • Fallout (also false-alarm rate)
  • Measures system ability to filter out
    non-relevant documents
  • FR FA/R (all non-relevant documents)
  • more stable than precision no dependence of R
    size
  • can be interpolated at Recall reference points
  • since FR is small usually log(FR) is plotted
  • F-score
  • A single value score
  • F2PR/(PR)

19
Additional Metrics
  • Miss Rate
  • Measures probability that the system misses
    relevant documents
  • MR (RA)/R
  • Can be interpolated vs. Recall or False Alarm
  • ROC Curves
  • Receiver Operating Characteristic Curve
  • Plots cost of running a system as function of
    e.g., false alarms and misses

20
ROC Curve of Signal Detection
21
Utility
  • Combines rank information with value of items
  • For retrieved items value to the user
  • For missed/false alarms items cost incurred
  • For fallout avoided cost savings
  • U (R A) (N B) (R- C) (N- D)

22
Reference Collections
  • Classical
  • CACM, ISI, Cranfield
  • Small (1-10 Mbytes) everything works
  • Complete manual judgments
  • TREC (Text Retrieval Conference)
  • Created in 1992
  • Now 5GBytes (2 million documents) 500 queries
  • Judgments obtained through pooling method
  • TDT
  • TDT2 60K broadcast stories 100 topics
  • High-quality judgments obtained in multiple
    iteration

23
CACM-3204 Collection
  • 3204 abstracts of Comm. ACM papers
  • Documents short, some just title
  • Short (12 word avg.) queries
  • Avg. 15 relevants per query
  • One of the standard pre-TREC collections
  • Reputation fairly easy

24
Performance Comparisons
25
Significance of Comparison
  • 11-pt precision frequently used as single-value
    metric.
  • Sparck-Jones chart
  • lt5 difference non noticeable (noise)
  • lt10 noticeable, but not significant
  • gt10 significant (material)
  • Depends upon collection characteristics
  • Chi-squared test (?2) measures deviation from
    expectation
  • Measured at significance level 0.05 or 0.01
    (2-tailed)

26
Significance Testing
  • Assume all systems performance is identical
  • Observed precision values vs. expected
  • HA 38,39,64 at recall level 40
  • H0 47,47,47 is null-hypothesis
  • ?2 measures deviation of sample from H0
  • ?2 ? (vo ve)2 / ve

27
How to set up IR experiments
  • An IR system to be tested
  • A document collection
  • size, type of material
  • training sub-collection (80) testing (20)
  • A set of queries
  • Best if obtained from users, e.g., Web,
  • Relevance judgments
  • Must be done objectively
  • Use multiple assessors, compare their results

28
Pooling method for Qrels
  • Take top K (100) documents for each query for
    each system
  • Remove duplicates, note overlap
  • Have the pool judged manually
  • Assume documents not in pool not relevant
  • Assume most relevants in the pool
  • Small difference for N100
  • No significant difference for N200

29
Formal Evaluations
  • TREC IR
  • TDT Filtering
  • MUC/DUC Information extraction
  • Measures recall and precision in filling data
    templates with info extracted from text
  • SUMMAC automated summarization
  • Can summary substitute the original?
  • Would a human select the same sentences?
  • How much time needed to comprehend a summary?
Write a Comment
User Comments (0)
About PowerShow.com