Title: A Framework for Human Evaluation of Search Engine Relevance
1A Framework for Human Evaluation of Search Engine
Relevance
- Kamal Ali
- Chi Chao Chang
- Yun-fang Juan
2Outline
- The Problem
- Framework of Test Types
- Results
- Expt. 1 Set-level versus Item-level
- Expt. 2 Perceived Relevance versus Landing-Page
Relevance - Expt. 3 Editorial Judges versus Panelist Judges
- Related Work
- Contributions
3The Problem
- Classical IR Cranfield technique
- Search Engine Experience
- Results UI Speed Advertising, spell
suggestions ... - Search Engine Relevance
- Users see summaries (abstracts), not documents
- Heterogeneity of results images, video, music,
... - User population more diverse
- Information needs more diverse (commercial,
entertainment) - Human Evaluation costs are high
- Holistic alternative judge set of results
4Framework of test types
- Dimension 1 Query Category
- Images, News, Product Research, Navigational,
- Dimension 2 Modality
- Implicit (click behavior) versus Explicit
(judgments) - Dimension 3 Judge Class
- Live users versus Panelists versus Domain experts
- Dimension 4 Granularity
- Set-level versus item-level
- Dimension 5 Depth
- Perceived relevance versus Landing-Page
(Actual) relevance - Dimension 6 Relativity
- Absolute judgments versus Relative judgments
5Explored in this work
Dimension 1 Query Category Random (mixed),
Adult, How-to, Map, Weather, Biographical,
Tourism, Sports,
Dimension 2 Modality
Implicit/Behavioral Explicit
Advantages Direct user behavior Can give reason for behavior
Disadvantages Click behavior etc is ambiguous Subject to self-reflection
6Explored in this work
Dimension 3 Judge Class
Live Panelists Domain Experts
Advantages End user Representative? Participation Highly incented Available High Quality
Disadvantages Participation Quality Biased class?
Dimension 4 Granularity
Set-level Item-level
Advantages Rewards diversity Penalizes duplicates Re-combinable
Disadvantages Credit Assignment Not recomposable User sees set, not single item
7Explored in this work
Dimension 5 Depth
Perceived Relevance Landing-Page Relevance
Advantages User sees this first Users final goal
Disadvantages Perceived relevance may be high, actual may be low
Dimension 6 Relativity
Absolute Relevance Relative Relevance
Advantages Usable by search engine engineers to optimize Easier for judges to judge
Disadvantages Harder for judges Cant incorporate 3rd engine easily Not re-combinable
8Experimental Methodology
- Random set of queries taken from Yahoo! Web logs
- Per-set
- Judge sees 10 items from a search engine
- Gives one judgment for entire set
- Order is as supplied by search engine
- Per-item
- Judge sees one item (document) at a time
- Gives one judgment per item
- Order is scrambled
- Side-by-side
- Judge sees 2 sets sides are scrambled
- Within each set, order is preserved
9Expt. 1 Effect of GranularitySet-level versus
Item-level
Methodology
- Domain/expert editors
- Self-selection of queries
- Value 1 Per-set judgments Given on a 3-scale
1best, 3worst, thus producing a discrete random
variable - Value 2 Item-level judgments10 item-level
judgments are rolled up to a single number (using
DCG roll-up function) - DCG values discretized into 3 bins/levels thus
producing the 2nd discrete random variable - Look at resulting 3 3 contingency matrix
- Compute correlation between these variables
10Expt 1 Effect of GranularitySet-level versus
Item-level Images
- Domain 1 Search at Image site or Image tab
- 299 image queries 2 3 judges per query 6856
judgments - 198 queries in common between set-level,
item-level tests - 20 images (in 45 matrix) shown per query
SET1 SET2 SET3 Marginal
DCG1 130 18 1 149 75
DCG2 16 13 7 36 18
DCG3 3 5 5 13 7
Marginal 149 75 36 18 13 7 198 r0.54
11Expt 1 Effect of GranularitySet-level versus
Item-level Images
- Interpretation of Results
- Spearman Correlation is a middling 0.54
- Look at outlier queries
- Hollow Man high set-level, low item-level
scores - Most items irrelevant explain low item-level
DCG recall judges were seeing images one at a
time in scrambled order - Set-level eye can quickly (in parallel) see
relevant image less sensitive to irrelevant
images - Set-level Ranking function was poor leading to
low DCG score. Unusual since normally set-level
picks out poor ranking.
12Expt 2 Effect of DepthPerceived Relevance vs.
Landing Page
- Fix granularity at Item level
- Perceived Relevance
- Title, abstract (summary) and URL shown (T.A.U.)
- Judgment 1 Relevance of title
- Judgment 2 Relevance of abstract
- Click on URL to reach Landing Page
- Judgment 3 Relevance of Landing Page
13Expt 2 Effect of DepthPerceived vs. Actual
Advertisements
- Domain 1 Search at News site
- Created compound random variable for Perceived
RelevanceANDd Title Relevance and Abstract
Relevance - Correlated with Landing Page (Actual)
Relevance - Higher Correlation News Title/Abstract
carefully constructed
landingPgNR landingPgR Marginal
perceivedNR 511 62 573 20
perceivedR 22 2283 2305 80
marginal 533 19 2345 81 2878 r0.91
14Expt. 3 Effect of Judge ClassEditors versus
Panelists
Methodology
- 1000 randomly selected queries frequency-biased
sampling - 40 judges, few hundred panelists
- Panelist
- Recruitment long-standing panel
- Reward gift certificate
- Remove panelists that completed test too quickly
or missed sentinel questions - QueryModalityExplicitGranularitySet-levelDe
pthMixedRelativityRelative - Questions
- 1. Does judge class affect overall conclusion on
which Search engine is better? - 2. Are there particular types of queries for
which significant differences exist?
15Expt. 3 Effect of Judge ClassEditors versus
Panelists
- Column percentages p( P E ) p(Peng1) .28
but p(Peng1 Eeng1) .35Lift .35 / .28
1.25 25 modest lift
Row marginal E engine1 E neutral E engine2
P engine1 28 35 26 16
P neutral 47 47 49 48
P engine2 25 17 25 35
16Expt. 3 Effect of Judge ClassEditors versus
Panelists
- Editor marginal distrib (.33, .33, .33)
- Panelist less likely to discern diff (.25,
.50, .25) - Given P favors Eng1, E are more likely to favor
Eng1 than if P favors Eng2.
E engine1 E neutral E engine2
Col. Marginal 37 32 30
P engine1 49 32 19
P neutral 36 34 30
P engine2 25 33 42
17Expt. 3 Effect of Judge ClassCorrelation,
Association
- Linear Model r2 0.03
- 33 Categorical Model f 0.29
- Test for Non-zero Association c2 16.3, 99.5
signif.
18Expt. 3 Effect of Judge ClassQualitative
Analysis
- Editors top feedback
- Ranking not as good (precision)
- Both equally good (precision)
- Relevant sites missing (recall)
- Perfect site missing (recall)
- Panelists top feedback
- Both equally good (precision)
- Too general (precision)
- Both equally bad (precision)
- Ranking not as good (precision)
- Panelists need to see other SE to penalize poor
recall.
19Related Work
- Mizzaro
- Framework with 3 key dimensions of evaluation
- Information needs (expression level)
- Information resources (TAU, documents,..)
- Information context
- High level of disagreement among judges
- Amento et al.
- Correlation Analysis
- Expert Judges
- Automated Metrics in-degree, PageRank, page
size, ...
20Contributions
- Framework
- Set-level to Item-level correlation
- Middling measuring different aspects of
relevance - measure aspects missed by per-itemPoor ranking,
duplicates, missed senses - Perceived Relevance to Actual relevance
- Higher correlation maybe because domain News
- Editorial judges versus Panelists
- Panelists sit on fence more
- Panelist focus on precision more, need other SE
for recall - Panelist methodology, reward structure is crucial