The Indexbased XXL Search Engine for Querying XML Data with Relevance Ranking - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

The Indexbased XXL Search Engine for Querying XML Data with Relevance Ranking

Description:

... for each occurrence. 10. Element Content Index (cont) Big cat. Africa ... 64 different element names. 208,409 elements. 22. Ontology-based Similarity Search ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 25
Provided by: csWa9
Category:

less

Transcript and Presenter's Notes

Title: The Indexbased XXL Search Engine for Querying XML Data with Relevance Ranking


1
The Index-based XXL Search Engine for Querying
XML Data with Relevance Ranking
  • Anja Theobald and Gerhard Weikum

Presented by Jianbin Wei CSC 8710 Nov. 4, 2003
2
Motivation
  • Proposed XML Query language, such as XQuery, is
    of limited value for XML documents from different
    sources
  • Tradition Web search engines do little about the
    structure of XML documents
  • XXL (Flexible XML Search Language) considers both
    of them.

3
Outline
  • Similarity query
  • Index support for XXL query
  • Query processing
  • Architecture of XXL
  • Conclusion

4
XXL Query
  • Exact-match condition
  • Animal-specimen cat
  • Similarity condition
  • Animal-species lion

5
Ontology-based Ontology
lowest common parent
sim(lion, brown bear) 1/1dist(lion, brown
bear)1/3.5
dist(lion, brown bear) siblingdist(big cat,
bear) length(lion, big cat)length(brown
bear, bear)2.5
siblingdist(big cat, bear) 1-2/20.5
6
Index Support for XXL Query
  • Element path index (EPI)
  • Element content index (ECI)
  • Ontology index (OI)

7
Element Path Index
  • Element name
  • List of its occurrences
  • Parents and children of each occurrence (depth is
    2)
  • Attribute as children

8
Element Path Index (cont)
predator
big cat
bear
brown bear
polar bear
lion
tiger
Big cat (zoo.xml, predator, lion, tiger)
9
Element Content Index
  • Index of every word
  • Inverse document frequency
  • Occurrence (which element the word appears)

of elements containing this word
of total elements
frequency for each occurrence
10
Element Content Index (cont)
Zoo.xml
Big cat Africa Bear Africa
Bird
Africa 2/3 big cat, 10/100 bear, 1/100
11
Ontology Index
  • Used for similarity search and result ranking

ele term_1 term_k term_1 term_k are
the most similar elements to element ele
Lion tiger bear brown bear
12
Query Processing
  • Query decomposition
  • Evaluation order
  • Index-based sub-query evaluation
  • Result composition

13
Query Decomposition
  • Decompose query into sub-queries

Select Z From zoo.xml where zoo.name
detroit as Z and Z.species lion
14
Evaluation Order
  • Sub-queries are evaluated in the order in which
    they appear in the origin query
  • Inside a sub-query, either top-down or bottom-up
    matching can be used.

15
Sub-query Evaluation
  • Sub-query elementary condition without
    Element Path Index returns exactly matched
    results
  • Sub-query elementary condition with Ontology
    Index returns results with similar terms, and
    then Element Path Index returns results for these
    terms
  • lion gt tiger, bear,

16
Sub-query Evaluation (cont)
  • Sub-query content condition without Element
    Content Index returns exactly matched results
  • Sub-query content condition with Ontology
    Index returns results with similar terms, and
    then Element Content Index returns results for
    these terms
  • Africa gt Asia, Europe,

17
Sub-query Evaluation (cont)
  • Evaluate concatenated elementary conditions
    c1.c2
  • Evaluate existence path condition
  • Evaluate existence content condition.

18
Result Composition
  • Compose results of sub-queries into a global
    result
  • Calculated the relevance score of this global
    result.

19
XXL Architecture
20
XXL Architecture (cont)
  • Service components crawler, query processor, XXL
    applet
  • Algorithmic components parsing, indexing, word
    stemming
  • Data components file structure, EPI, ECI

21
Experimental Evaluation
  • Collection of religious books
  • Collection of Shakespeare plays
  • Bibliographic data of ACM SIGMOD
  • Synthetic bibliographic data
  • 45 XML documents
  • 64 different element names
  • 208,409 elements

22
Ontology-based Similarity Search
  • No clear preference for top-down or bottom-up
  • Low relevance score has little effect on the
    relative score.

23
Queries with Path Expressions
  • Bottom-up is better than top-down (the path end
    is more selective)

24
Conclusions
  • Ontology-based similarity query
  • Index-based search
Write a Comment
User Comments (0)
About PowerShow.com