Title: A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort
1A Ranking Scheme for XML Information Retrieval
Based on Benefit and Reading Effort
- Toshiyuki Shimizu (Kyoto University)
- Masatoshi Yoshikawa (Kyoto University)
ICADL 2007 12th December
2XML-IR systems
- Growing demand for XML Information Retrieval
(XML-IR) Systems - We can identify meaningful document fragments by
encoding documents in XML - ex) Sections, subsections and paragraphs
in scholarly articles - Browsing only document fragments relevant to a
certain topic - The most simple form of queries for XML-IR is
just a set of keywords - Simple, intuitively understandable, yet useful
form of queries, especially for unskilled
end-users - Active research area as in INEX
INitiative for the Evaluation of XML Retrieval
(http//inex.is.informatik.uni-duisburg.de/)
3Results of XML-IR Systems
lt?xml version"1.0"?gt ltarticlegt ltsecgt
ltpgtXML labelinglt/pgt ltpgtThe structure of XML
is a tree, and each node in the XML is
labeled.lt/pgt ltpgtWe can get tag name of
each XML element.lt/pgt lt/secgt ltsecgt
ltpgtTree indexlt/pgt ltpgtXML index is
constructed using the labelslt/pgt
lt/secgt lt/articlegt
- Document fragment (element)
- With relevance degree (Score)
- ex) Query term was XML
e0
0.56
article
0.35
0.64
e1
e5
sec
sec
0.4
0.9
0
0.33
0.8
e6
e7
e3
e2
e4
p
p
p
p
p
Score
4Naïve XML-IR System
- Thorough strategy of INEX 2005
- Simply retrieves relevant elements from all
elements and ranks them in order of relevance
e3 (0.9) e7 (0.8) e1 (0.64) e0
(0.56) e2 (0.4) e5 (0.35) e4 (0.33)
Score
e0
0.56
article
0.64
0.35
e1
e5
sec
sec
0.4
0.9
0
0.33
0.8
e7
e6
e3
e2
e4
p
p
p
p
p
- Thorough is considered for system evaluation
- User behavior of browsing search results must be
considered
5Problems of Thorough Retrieval for XML-IR
- Nesting elements
- Browsing both elements is useless
- Ancestor element ea ? Descendant element ed
- ed has been fully seen
- Descendant element ed ? Ancestor element ea
- ea has been partially seen before
- Element size
- Elements retrieved by XML-IR systems varies
widely in size - Large element, such as article (whole document)
- Small element, such as p (paragraph)
- Total output size of top-k elements is
uncontrollable by simply giving an integer k
6Overview of our Approach
- Introduction of the concepts of benefit and
reading effort - Users can control the total output size
- Systems can retrieve non-overlapping elements
7Properties of Benefit and Reading Effort (1/2)
- Benefit
- The benefit of an element is the amount of gain
about the query by reading the element - Assumption 1 The benefit of an element is
greater than or equal to the sum of the benefit
of the child elements - Information complementation among sibling
elements - ex) For two query terms A and B e6 contains
topics about A e7 contains topics about B ?
The benefit of e5 seems to be greater than
the sum of benefit of e6 and e7
e5
sec
e7
e6
p
p
8Properties of Benefit and Reading Effort (2/2)
- Reading Effort
- The reading effort of an element is the amount of
cost by reading the content of the element - Assumption 2 The reading effort of an element is
less than or equal to the sum of the reading
effort of the child elements - Readability of continuous reading
- ex) Users can read the same content more easily
by reading e5 rather than separate e6 and e7
e5
sec
e7 e6
e5
e7
e6
p
p
9Overview of our Approach
- Introduction of the concepts of benefit and
reading effort - Users can control the total output size
- Systems can retrieve non-overlapping elements
- Flexible retrieval
- Users specify a threshold for the total amount of
reading effort - The systems return relevant elements that provide
larger benefit and that can be read within
specified reading effort
10Flexible Retrieval
- Systems calculate benefit and reading effort
- A variant of knapsack problems
- ex) Threshold of reading effort 15
? Retrieve e2, e3 (Total benefit 11)
e0
article
e1
e5
sec
sec
e3
e2
e4
e7
e6
p
p
p
p
p
11Flexible Retrieval
- Systems calculate benefit and reading effort
- A variant of knapsack problems
- ex) Threshold of reading effort 20
? Retrieve e3, e7 (Total benefit 17)
e0
article
e1
e5
sec
sec
e3
e2
e4
e7
e6
p
p
p
p
p
12Search Result Continuity
ex) reading effort 15 ? Retrieve e2, e3
(benefit 11) reading effort 20 ?
Retrieve e3, e7 (benefit 17)
- The running example violate search result
continuity - The content of element set for reading effort r
must be contained in the content of element set
for reading effort r if r lt r - The optimal solution
- is NP-hard (A variant of knapsack problems)
- may violate search result continuity
- Greedy retrieval algorithm
13Retrieval Algorithm
- Based on the result of Thorough strategy
- Adjust benefit and reading effort for nesting
elements of retrieved element, and rerank - Remove overlapping contents by nestings
Simply retrieves relevant elements from all
elements and ranks them in order of relevance
e0
Result of Thorough
0.56
article
e3 (0.9) e7 (0.8) e1 (0.64) e0
(0.56) e2 (0.4) e5 (0.35) e4 (0.33)
e1
e5
0.64
0.35
sec
sec
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
14Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 9
e7 (0.8)
Amount of reading effort 10
e1 (0.64)
e1 (0.5)
Adjust e1 , e0
e0
e0 (0.56)
e0 (0.48)
0.48
0.56
e2 (0.4)
article
e5 (0.35)
e4 (0.33)
e1
e5
0.64
0.35
0.5
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
15Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 17
Amount of benefit 9
e7 (0.8)
Amount of reading effort 20
Amount of reading effort 10
e7 (0.8)
e1 (0.5)
e0
e0 (0.48)
e0 (0.37)
0.37
0.48
Adjust and rerank e5 , e0
e2 (0.4)
article
e5 (0.35)
e5 (0)
e4 (0.33)
e1
e5
0.35
0
0.5
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
16Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e3 (0.9)
Amount of benefit 26
Amount of benefit 17
e7 (0.8)
Amount of reading effort 38
Amount of reading effort 20
e7 (0.8)
e1 (0.5)
e1 (0.5)
e0
e2 (0.4)
0.17
0.37
e0 (0.37)
e0 (0.17)
article
Adjust and rerank e0
e4 (0.33)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e3
e2
e4
e7
e6
0.9
0.8
0.33
p
p
p
p
p
17Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
18Retrieval Algorithm
Result of Thorough
Our result
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
19Retrieval Algorithm
Our result
Result of Thorough
e3 (0.9)
e7 (0.8)
Amount of benefit 26
Amount of benefit 26
e7 (0.8)
e1 (0.5)
Amount of reading effort 38
Amount of reading effort 38
e1 (0.5)
e0
e2 (0.4)
0.17
article
e4 (0.33)
e0 (0.17)
e5 (0)
e1
e5
0.5
0
sec
sec
Threshold of reading effort 40
0
0.4
e4
e3
e2
e7
e6
0.9
0.8
0.33
p
p
p
p
p
20Evaluation Metrics
- Based on benefit and reading effort
- b/e graph (benefit/effort graph)
- Comparison with BTIL (Best Thorough Input List)
- BTIL system is the system which use actual
benefit and reading effort - Actual benefit is calculated using manually
constructed assessments (e.g. INEX) - We can observe relative effectiveness of benefit
changing the specified threshold of reading
effort - Use the same values for reading effort between
implemented system and BTIL system
21e0
e0
article
article
e1
e1
e5
e5
sec
sec
sec
sec
e3
e2
e4
e7
e3
e2
e6
e4
e7
e6
p
p
p
p
p
p
p
p
p
p
Calculated benefit / reading effort
Actual benefit / reading effort
For the threshold value 30 of reading effort
BTIL system retrieves e3, e6 Obtained actual
benefit is 23
Implemented system retrieves e3, e7Obtained
actual benefit is 10
22Examples of b/e Graph using INEX 2005 Test
Collection (1/2)
- XML document set, Topics, Assessments
- Calculate actual benefit and reading effort from
Assessments - ex (Exhaustivity) Highly exhaustive (HE) ? 1
Partially exhaustive (PE) ? 0.5 Not
exhaustive(NE) ? 0 - rsize relevant text length (in number of
characters) - size element length (in number of characters)
- We implemented a system using tf-ief
- ief stands for inverse element frequency
- satisfies Assumptions for benefit and reading
effort
parameter
23Examples of b/e Graph using INEX 2005 Test
Collection (2/2)
Topic 207
Topic 206
- We can observe relative effectiveness of
implemented systems against BTIL system
24Conclusions and Future Works
- Conclusions
- Introduction of benefit and reading effort
- Handling nesting elements
- Variety of element size
- Algorithm for flexible retrieval
- Result elements change depending on the specified
reading effort - System evaluation
- Future Works
- Introduction of switching effort
- Cost of switching a result item in the results
list - Retrieving numerous results increases the cost of
browsing - Integration with user interface