A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort

Description:

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort Toshiyuki Shimizu (Kyoto University) Masatoshi Yoshikawa (Kyoto University) – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 25

Provided by: SHIMIZU3

Category:

more less

Transcript and Presenter's Notes

Title: A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort

1
A Ranking Scheme for XML Information Retrieval
Based on Benefit and Reading Effort

Toshiyuki Shimizu (Kyoto University)
Masatoshi Yoshikawa (Kyoto University)

ICADL 2007 12th December
2
XML-IR systems

Growing demand for XML Information Retrieval
(XML-IR) Systems
We can identify meaningful document fragments by
encoding documents in XML
ex) Sections, subsections and paragraphs
in scholarly articles
Browsing only document fragments relevant to a
certain topic
The most simple form of queries for XML-IR is
just a set of keywords
Simple, intuitively understandable, yet useful
form of queries, especially for unskilled
end-users
Active research area as in INEX

INitiative for the Evaluation of XML Retrieval
(http//inex.is.informatik.uni-duisburg.de/)
3
Results of XML-IR Systems
lt?xml version"1.0"?gt ltarticlegt ltsecgt
ltpgtXML labelinglt/pgt ltpgtThe structure of XML
is a tree, and each node in the XML is
labeled.lt/pgt ltpgtWe can get tag name of
each XML element.lt/pgt lt/secgt ltsecgt
ltpgtTree indexlt/pgt ltpgtXML index is
constructed using the labelslt/pgt
lt/secgt lt/articlegt

Document fragment (element)
With relevance degree (Score)
ex) Query term was XML

e0
0.56
article
0.35
0.64
e1
e5
sec
sec
0.4
0.9
0
0.33
0.8
e6
e7
e3
e2
e4
p
p
p
p
p
Score
4
Naïve XML-IR System

Thorough strategy of INEX 2005

Simply retrieves relevant elements from all
elements and ranks them in order of relevance

e3 (0.9) e7 (0.8) e1 (0.64) e0
(0.56) e2 (0.4) e5 (0.35) e4 (0.33)
Score
e0
0.56
article
0.64
0.35
e1
e5
sec
sec
0.4
0.9
0
0.33
0.8
e7
e6
e3
e2
e4
p
p
p
p
p

Thorough is considered for system evaluation
User behavior of browsing search results must be
considered

5
Problems of Thorough Retrieval for XML-IR

Nesting elements
Browsing both elements is useless
Ancestor element ea ? Descendant element ed
ed has been fully seen
Descendant element ed ? Ancestor element ea
ea has been partially seen before
Element size
Elements retrieved by XML-IR systems varies
widely in size
Large element, such as article (whole document)
Small element, such as p (paragraph)
Total output size of top-k elements is
uncontrollable by simply giving an integer k

6
Overview of our Approach

Introduction of the concepts of benefit and
reading effort
Users can control the total output size
Systems can retrieve non-overlapping elements

7
Properties of Benefit and Reading Effort (1/2)

Benefit
The benefit of an element is the amount of gain
about the query by reading the element
Assumption 1 The benefit of an element is
greater than or equal to the sum of the benefit
of the child elements
Information complementation among sibling
elements
ex) For two query terms A and B e6 contains
topics about A e7 contains topics about B ?
The benefit of e5 seems to be greater than
the sum of benefit of e6 and e7

e5
sec
e7
e6
p
p
8
Properties of Benefit and Reading Effort (2/2)

Reading Effort
The reading effort of an element is the amount of
cost by reading the content of the element
Assumption 2 The reading effort of an element is
less than or equal to the sum of the reading
effort of the child elements
Readability of continuous reading
ex) Users can read the same content more easily
by reading e5 rather than separate e6 and e7

e5
sec
e7 e6
e5
e7
e6
p
p
9
Overview of our Approach

Introduction of the concepts of benefit and
reading effort
Users can control the total output size
Systems can retrieve non-overlapping elements
Flexible retrieval
Users specify a threshold for the total amount of
reading effort
The systems return relevant elements that provide
larger benefit and that can be read within
specified reading effort

10
Flexible Retrieval

Systems calculate benefit and reading effort
A variant of knapsack problems
ex) Threshold of reading effort 15

? Retrieve e2, e3 (Total benefit 11)
e0
article
e1
e5
sec
sec
e3
e2
e4
e7
e6
p
p
p
p
p
11
Flexible Retrieval

Systems calculate benefit and reading effort
A variant of knapsack problems
ex) Threshold of reading effort 20

? Retrieve e3, e7 (Total benefit 17)
e0
article
e1
e5
sec
sec
e3
e2
e4
e7
e6
p
p
p
p
p
12
Search Result Continuity
ex) reading effort 15 ? Retrieve e2, e3
(benefit 11) reading effort 20 ?
Retrieve e3, e7 (benefit 17)

The running example violate search result
continuity
The content of element set for reading effort r
must be contained in the content of element set
for reading effort r if r lt r
The optimal solution
is NP-hard (A variant of knapsack problems)
may violate search result continuity
Greedy retrieval algorithm

13
Retrieval Algorithm

Based on the result of Thorough strategy
Adjust benefit and reading effort for nesting
elements of retrieved element, and rerank
Remove overlapping contents by nestings

Simply retrieves relevant elements from all
elements and ranks them in order of relevance
e0
Result of Thorough
0.56
article
e3 (0.9) e7 (0.8) e1 (0.64) e0
(0.56) e2 (0.4) e5 (0.35) e4 (0.33)
e1
e5
0.64
0.35
sec
sec
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
14
Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 9
e7 (0.8)
Amount of reading effort 10
e1 (0.64)
e1 (0.5)
Adjust e1 , e0
e0
e0 (0.56)
e0 (0.48)
0.48
0.56
e2 (0.4)
article
e5 (0.35)
e4 (0.33)
e1
e5
0.64
0.35
0.5
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
15
Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 17
Amount of benefit 9
e7 (0.8)
Amount of reading effort 20
Amount of reading effort 10
e7 (0.8)
e1 (0.5)
e0
e0 (0.48)
e0 (0.37)
0.37
0.48
Adjust and rerank e5 , e0
e2 (0.4)
article
e5 (0.35)
e5 (0)
e4 (0.33)
e1
e5
0.35
0
0.5
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
16
Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 26
Amount of benefit 17
e7 (0.8)
Amount of reading effort 38
Amount of reading effort 20
e7 (0.8)
e1 (0.5)
e1 (0.5)
e0
e2 (0.4)
0.17
0.37
e0 (0.37)
e0 (0.17)
article
Adjust and rerank e0
e4 (0.33)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
17
Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
18
Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
19
Retrieval Algorithm
Our result
Result of Thorough
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
20
Evaluation Metrics

Based on benefit and reading effort
b/e graph (benefit/effort graph)
Comparison with BTIL (Best Thorough Input List)
BTIL system is the system which use actual
benefit and reading effort
Actual benefit is calculated using manually
constructed assessments (e.g. INEX)
We can observe relative effectiveness of benefit
changing the specified threshold of reading
effort
Use the same values for reading effort between
implemented system and BTIL system

21
e0
e0
article
article
e1
e1
e5
e5
sec
sec
sec
sec
e3
e2
e4
e7
e3
e2
e6
e4
e7
e6
p
p
p
p
p
p
p
p
p
p
Calculated benefit / reading effort
Actual benefit / reading effort
For the threshold value 30 of reading effort
BTIL system retrieves e3, e6 Obtained actual
benefit is 23
Implemented system retrieves e3, e7Obtained
actual benefit is 10
22
Examples of b/e Graph using INEX 2005 Test
Collection (1/2)

XML document set, Topics, Assessments
Calculate actual benefit and reading effort from
Assessments
ex (Exhaustivity) Highly exhaustive (HE) ? 1
Partially exhaustive (PE) ? 0.5 Not
exhaustive(NE) ? 0
rsize relevant text length (in number of
characters)
size element length (in number of characters)
We implemented a system using tf-ief
ief stands for inverse element frequency
satisfies Assumptions for benefit and reading
effort