A Framework for Human Evaluation of Search Engine Relevance - PowerPoint PPT Presentation

About This Presentation
Title:

A Framework for Human Evaluation of Search Engine Relevance

Description:

Results UI Speed Advertising, spell suggestions ... Search Engine Relevance ... Live users versus Panelists versus Domain experts. Dimension 4: Granularity ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 21
Provided by: aiS9
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Human Evaluation of Search Engine Relevance


1
A Framework for Human Evaluation of Search Engine
Relevance
  • Kamal Ali
  • Chi Chao Chang
  • Yun-fang Juan

2
Outline
  • The Problem
  • Framework of Test Types
  • Results
  • Expt. 1 Set-level versus Item-level
  • Expt. 2 Perceived Relevance versus Landing-Page
    Relevance
  • Expt. 3 Editorial Judges versus Panelist Judges
  • Related Work
  • Contributions

3
The Problem
  • Classical IR Cranfield technique
  • Search Engine Experience
  • Results UI Speed Advertising, spell
    suggestions ...
  • Search Engine Relevance
  • Users see summaries (abstracts), not documents
  • Heterogeneity of results images, video, music,
    ...
  • User population more diverse
  • Information needs more diverse (commercial,
    entertainment)
  • Human Evaluation costs are high
  • Holistic alternative judge set of results

4
Framework of test types
  • Dimension 1 Query Category
  • Images, News, Product Research, Navigational,
  • Dimension 2 Modality
  • Implicit (click behavior) versus Explicit
    (judgments)
  • Dimension 3 Judge Class
  • Live users versus Panelists versus Domain experts
  • Dimension 4 Granularity
  • Set-level versus item-level
  • Dimension 5 Depth
  • Perceived relevance versus Landing-Page
    (Actual) relevance
  • Dimension 6 Relativity
  • Absolute judgments versus Relative judgments

5
Explored in this work
Dimension 1 Query Category Random (mixed),
Adult, How-to, Map, Weather, Biographical,
Tourism, Sports,
Dimension 2 Modality
Implicit/Behavioral Explicit
Advantages Direct user behavior Can give reason for behavior
Disadvantages Click behavior etc is ambiguous Subject to self-reflection
6
Explored in this work
Dimension 3 Judge Class
Live Panelists Domain Experts
Advantages End user Representative? Participation Highly incented Available High Quality
Disadvantages Participation Quality Biased class?
Dimension 4 Granularity
Set-level Item-level
Advantages Rewards diversity Penalizes duplicates Re-combinable
Disadvantages Credit Assignment Not recomposable User sees set, not single item
7
Explored in this work
Dimension 5 Depth
Perceived Relevance Landing-Page Relevance
Advantages User sees this first Users final goal
Disadvantages Perceived relevance may be high, actual may be low
Dimension 6 Relativity
Absolute Relevance Relative Relevance
Advantages Usable by search engine engineers to optimize Easier for judges to judge
Disadvantages Harder for judges Cant incorporate 3rd engine easily Not re-combinable
8
Experimental Methodology
  • Random set of queries taken from Yahoo! Web logs
  • Per-set
  • Judge sees 10 items from a search engine
  • Gives one judgment for entire set
  • Order is as supplied by search engine
  • Per-item
  • Judge sees one item (document) at a time
  • Gives one judgment per item
  • Order is scrambled
  • Side-by-side
  • Judge sees 2 sets sides are scrambled
  • Within each set, order is preserved

9
Expt. 1 Effect of GranularitySet-level versus
Item-level
Methodology
  • Domain/expert editors
  • Self-selection of queries
  • Value 1 Per-set judgments Given on a 3-scale
    1best, 3worst, thus producing a discrete random
    variable
  • Value 2 Item-level judgments10 item-level
    judgments are rolled up to a single number (using
    DCG roll-up function)
  • DCG values discretized into 3 bins/levels thus
    producing the 2nd discrete random variable
  • Look at resulting 3 3 contingency matrix
  • Compute correlation between these variables

10
Expt 1 Effect of GranularitySet-level versus
Item-level Images
  • Domain 1 Search at Image site or Image tab
  • 299 image queries 2 3 judges per query 6856
    judgments
  • 198 queries in common between set-level,
    item-level tests
  • 20 images (in 45 matrix) shown per query

SET1 SET2 SET3 Marginal
DCG1 130 18 1 149 75
DCG2 16 13 7 36 18
DCG3 3 5 5 13 7
Marginal 149 75 36 18 13 7 198 r0.54
11
Expt 1 Effect of GranularitySet-level versus
Item-level Images
  • Interpretation of Results
  • Spearman Correlation is a middling 0.54
  • Look at outlier queries
  • Hollow Man high set-level, low item-level
    scores
  • Most items irrelevant explain low item-level
    DCG recall judges were seeing images one at a
    time in scrambled order
  • Set-level eye can quickly (in parallel) see
    relevant image less sensitive to irrelevant
    images
  • Set-level Ranking function was poor leading to
    low DCG score. Unusual since normally set-level
    picks out poor ranking.

12
Expt 2 Effect of DepthPerceived Relevance vs.
Landing Page
  • Fix granularity at Item level
  • Perceived Relevance
  • Title, abstract (summary) and URL shown (T.A.U.)
  • Judgment 1 Relevance of title
  • Judgment 2 Relevance of abstract
  • Click on URL to reach Landing Page
  • Judgment 3 Relevance of Landing Page

13
Expt 2 Effect of DepthPerceived vs. Actual
Advertisements
  • Domain 1 Search at News site
  • Created compound random variable for Perceived
    RelevanceANDd Title Relevance and Abstract
    Relevance
  • Correlated with Landing Page (Actual)
    Relevance
  • Higher Correlation News Title/Abstract
    carefully constructed

landingPgNR landingPgR Marginal
perceivedNR 511 62 573 20
perceivedR 22 2283 2305 80
marginal 533 19 2345 81 2878 r0.91
14
Expt. 3 Effect of Judge ClassEditors versus
Panelists
Methodology
  • 1000 randomly selected queries frequency-biased
    sampling
  • 40 judges, few hundred panelists
  • Panelist
  • Recruitment long-standing panel
  • Reward gift certificate
  • Remove panelists that completed test too quickly
    or missed sentinel questions
  • QueryModalityExplicitGranularitySet-levelDe
    pthMixedRelativityRelative
  • Questions
  • 1. Does judge class affect overall conclusion on
    which Search engine is better?
  • 2. Are there particular types of queries for
    which significant differences exist?

15
Expt. 3 Effect of Judge ClassEditors versus
Panelists
  • Column percentages p( P E ) p(Peng1) .28
    but p(Peng1 Eeng1) .35Lift .35 / .28
    1.25 25 modest lift

Row marginal E engine1 E neutral E engine2
P engine1 28 35 26 16
P neutral 47 47 49 48
P engine2 25 17 25 35
16
Expt. 3 Effect of Judge ClassEditors versus
Panelists
  • Editor marginal distrib (.33, .33, .33)
  • Panelist less likely to discern diff (.25,
    .50, .25)
  • Given P favors Eng1, E are more likely to favor
    Eng1 than if P favors Eng2.

E engine1 E neutral E engine2
Col. Marginal 37 32 30
P engine1 49 32 19
P neutral 36 34 30
P engine2 25 33 42
17
Expt. 3 Effect of Judge ClassCorrelation,
Association
  • Linear Model r2 0.03
  • 33 Categorical Model f 0.29
  • Test for Non-zero Association c2 16.3, 99.5
    signif.

18
Expt. 3 Effect of Judge ClassQualitative
Analysis
  • Editors top feedback
  • Ranking not as good (precision)
  • Both equally good (precision)
  • Relevant sites missing (recall)
  • Perfect site missing (recall)
  • Panelists top feedback
  • Both equally good (precision)
  • Too general (precision)
  • Both equally bad (precision)
  • Ranking not as good (precision)
  • Panelists need to see other SE to penalize poor
    recall.

19
Related Work
  • Mizzaro
  • Framework with 3 key dimensions of evaluation
  • Information needs (expression level)
  • Information resources (TAU, documents,..)
  • Information context
  • High level of disagreement among judges
  • Amento et al.
  • Correlation Analysis
  • Expert Judges
  • Automated Metrics in-degree, PageRank, page
    size, ...

20
Contributions
  • Framework
  • Set-level to Item-level correlation
  • Middling measuring different aspects of
    relevance
  • measure aspects missed by per-itemPoor ranking,
    duplicates, missed senses
  • Perceived Relevance to Actual relevance
  • Higher correlation maybe because domain News
  • Editorial judges versus Panelists
  • Panelists sit on fence more
  • Panelist focus on precision more, need other SE
    for recall
  • Panelist methodology, reward structure is crucial
Write a Comment
User Comments (0)
About PowerShow.com