Title: The Indexbased XXL Search Engine for Querying XML Data with Relevance Ranking
1The Index-based XXL Search Engine for Querying
XML Data with Relevance Ranking
- Anja Theobald and Gerhard Weikum
Presented by Jianbin Wei CSC 8710 Nov. 4, 2003
2Motivation
- Proposed XML Query language, such as XQuery, is
of limited value for XML documents from different
sources - Tradition Web search engines do little about the
structure of XML documents - XXL (Flexible XML Search Language) considers both
of them.
3Outline
- Similarity query
- Index support for XXL query
- Query processing
- Architecture of XXL
- Conclusion
4XXL Query
- Exact-match condition
- Animal-specimen cat
- Similarity condition
- Animal-species lion
5Ontology-based Ontology
lowest common parent
sim(lion, brown bear) 1/1dist(lion, brown
bear)1/3.5
dist(lion, brown bear) siblingdist(big cat,
bear) length(lion, big cat)length(brown
bear, bear)2.5
siblingdist(big cat, bear) 1-2/20.5
6Index Support for XXL Query
- Element path index (EPI)
- Element content index (ECI)
- Ontology index (OI)
7Element Path Index
- Element name
- List of its occurrences
- Parents and children of each occurrence (depth is
2) - Attribute as children
8Element Path Index (cont)
predator
big cat
bear
brown bear
polar bear
lion
tiger
Big cat (zoo.xml, predator, lion, tiger)
9Element Content Index
- Index of every word
- Inverse document frequency
- Occurrence (which element the word appears)
-
of elements containing this word
of total elements
frequency for each occurrence
10Element Content Index (cont)
Zoo.xml
Big cat Africa Bear Africa
Bird
Africa 2/3 big cat, 10/100 bear, 1/100
11Ontology Index
- Used for similarity search and result ranking
ele term_1 term_k term_1 term_k are
the most similar elements to element ele
Lion tiger bear brown bear
12Query Processing
- Query decomposition
- Evaluation order
- Index-based sub-query evaluation
- Result composition
13Query Decomposition
- Decompose query into sub-queries
Select Z From zoo.xml where zoo.name
detroit as Z and Z.species lion
14Evaluation Order
- Sub-queries are evaluated in the order in which
they appear in the origin query - Inside a sub-query, either top-down or bottom-up
matching can be used.
15Sub-query Evaluation
- Sub-query elementary condition without
Element Path Index returns exactly matched
results - Sub-query elementary condition with Ontology
Index returns results with similar terms, and
then Element Path Index returns results for these
terms - lion gt tiger, bear,
16Sub-query Evaluation (cont)
- Sub-query content condition without Element
Content Index returns exactly matched results - Sub-query content condition with Ontology
Index returns results with similar terms, and
then Element Content Index returns results for
these terms - Africa gt Asia, Europe,
17Sub-query Evaluation (cont)
- Evaluate concatenated elementary conditions
c1.c2 - Evaluate existence path condition
- Evaluate existence content condition.
18Result Composition
- Compose results of sub-queries into a global
result - Calculated the relevance score of this global
result.
19XXL Architecture
20XXL Architecture (cont)
- Service components crawler, query processor, XXL
applet - Algorithmic components parsing, indexing, word
stemming - Data components file structure, EPI, ECI
21Experimental Evaluation
- Collection of religious books
- Collection of Shakespeare plays
- Bibliographic data of ACM SIGMOD
- Synthetic bibliographic data
- 45 XML documents
- 64 different element names
- 208,409 elements
22Ontology-based Similarity Search
- No clear preference for top-down or bottom-up
- Low relevance score has little effect on the
relative score.
23Queries with Path Expressions
- Bottom-up is better than top-down (the path end
is more selective)
24Conclusions
- Ontology-based similarity query
- Index-based search