The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking - PowerPoint PPT Presentation

About This Presentation
Title:

The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking

Description:

E Commerce. Weekend: Data Mining. Dozent. URL=... Inhalt. Semistructured ... E Commerce. Weekend: Data Mining. Outline: ... statistical methods. for classification ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 25
Provided by: Gerhard69
Category:

less

Transcript and Presenter's Notes

Title: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking


1
The Index-based XXL Search Enginefor Querying
XML Datawith Relevance Ranking
  • Anja Theobald and Gerhard Weikum
  • University of the Saarland
  • Saarbrücken, Germany

weikum_at_cs.uni-sb.de http//www-dbs.cs.uni-sb.de
2
Conclusion
  • Problem
  • diversity of Web / Intranet data
  • ? despite XML, global schema is a myth
  • ? users are swamped with results or
  • are looking for needles in haystacks

Our contribution
  • combine XML querying with relevance ranking
  • demonstrate efficiency and search result quality
  • with XXL search engine prototype

3
Outline

Adding relevance to XML


The XXL search engine index-based query
processing

Experiments
4
XML Data Graph
...
5
XML Querying
Book
www.allunis.de/unis.xml
Title Stochastic ...
Author R. Nelson
Review ... Chapter on Markov chains
Uni Uni Stuttgart
...
School CS
Uni Uni Saarland
...
...
Course Mobile comm.
School ...
School ...
...
...
Prerequisites ... Markov processes
Dept ... CS
...
Teaching
...
Uni Uni Augsburg
GradStudies
...
Curriculum E Commerce
...
Course Speech processing
Course Performance analysis
...
Weekend Data Mining
...
...
...
Outline ... statistical methods for
classification ...
Content ... Queueing models
Lit
Lit
Content ... Markov chains
...

Select U, C From www.allunis.de/unis.xml Where
Uni As U And U..School?..(Inst Dept) As D
And D Like CS And D..Course As C And C.
Like Markov chain
6
XML Querying
Book
www.allunis.de/unis.xml
Title Stochastic ...
Author R. Nelson
Review ... Chapter on Markov chains
Uni Uni Stuttgart
Uni
...
Markov chains
School CS
School
CS
Uni Uni Saarland
Uni
...
...
Course
Course Mobile comm.
School ...
School ...
School
School
...
...
Prerequisites ... Markov processes
Dept ... CS
...
Dept
CS
Teaching
...
Uni Uni Augsburg
Uni
GradStudies
...
Curriculum E Commerce
...
Course
Course
Course Speech processing
Course Performance analysis
...
Weekend Data Mining
...
...
...
Outline ... statistical methods for
classification ...
Content ... Queueing models
Lit
Lit
Content ... Markov chains
...
Markov chains

U, C
Select U, C From www.allunis.de/unis.xml Where
Uni As U And U..School?..(Inst Dept) As D
And D Like CS And D..Course As C And C.
Like Markov chain
Uni As U
U..School?..(Inst Dept) As D
D Like CS
D..Course As C
C. Like Markov chain
7
Boolean vs. Ranked Retrieval
8
Ranked Retrieval with XXL
Book
www.allunis.de/unis.xml
Title Stochastic ...
Author R. Nelson
Review ... Chapter on Markov chains
Uni Uni Stuttgart
...
School CS
Uni Uni Saarland
...
...
Course Mobile comm.
School ...
School ...
...
...
Prerequisites ... Markov processes
Dept ... CS
...
Teaching
...
Uni Uni Augsburg
GradStudies
...
Curriculum E Commerce
...
Course Speech processing
Course Performance analysis
...
Weekend Data Mining
...
...
...
Outline ... statistical methods for
classification ...
Content ... Queueing models
Lit
Lit
Content ... Markov chains
...

Select U, C From www.allunis.de/unis.xml Where
Uni As U And U. As D And D CS And
D..Course As C AND C. Markov chain
9
Ranked Retrieval with XXL
Book
www.allunis.de/unis.xml
Title Stochastic ...
Author R. Nelson
Review ... Chapter on Markov chains
Uni Uni Stuttgart
...
School CS
Uni Uni Saarland
...
...
Course Mobile comm.
School ...
School ...
...
...
Prerequisites ... Markov processes
...
Dept ... CS
Teaching
...
Uni Uni Augsburg
GradStudies
...
Curriculum E Commerce
...
Course Speech processing
Course Performance analysis
...
Weekend Data Mining
...
...
...
Outline ... statistical methods for
classification ...
Content ... Queueing models
Lit
Lit
Content ... Markov chains
...

Select U, C From www.allunis.de/unis.xml Where
Uni As U And U. As D And D Computer
Science And D..Course As C and C.
Markov chain
10
Outline
Adding relevance to XML
?


The XXL search engine index-based query
processing

Experiments
11
XXL Flexible XML Search Language
Extensible, simple core language
Where clause conjunction of regular path
expressions with binding
of variables
Elementary conditions on element/attribute names
and contents
Select F, D, S From www.allunis.de/unis.xml
Where Uni..School?..(InstDept) As F And
F..Lecturer As D And F..Student As S And D.Name
S.Name And D.Area Like XML
12
XXL Result Ranking
Query
Where Uni..School?..(InstDept) As D
And D..Lecturer As D And D.Area XML
Data graph
Result graph
Uni UniSaarland
1.0
Uni UniSaarland
Dept CS
Dept Math
1.0
Dept CS
Dept Math
Prof GW
0.9
Prof GW
0.8
Teaching
Project IR for semistruct. data
Project IR for semistruct. data
0.6
Project Digital libraries
Course IR
Relevance score 0.432 1.0 1.0 0.9 0.8
0.6
Seminar XML
13
XXL Search Engine
WWW
...... ..... ...... .....
XXL servlets
Path indexer
XXL applet
Query processor
Content indexer
Ontology
Select ... Where Uni..(InstDept) As F And F
Computer Science And F..Course.
Markov Chains
Uni..(InstDept) As F
F Computer Science
F..Course. Markov Chains
F..Course. Markov Chains
F..Seminar. Markov Chains
F..Seminar. Markov Chains
14
Index Structures
Element Path Index
materializes all (parent, child) element name
pairs and dynamically checks transitive
connectivity
Uni, id1, ltSchool, id13, id14gt
ltProf, id111, id117, id119gt,
id2, ltProfgt, id15gt School,
id13, ltDean, id27gt,
ltDept, id31, id32, id33gt,
id14, ...
precomputes all term occurrences in element
contents, with frequency statistics
Element Content Index
Engineering, idf..., ltid79, tf...gt, ltid85,
tf...gt XML, idf..., ltid46,
tf...gt, ltid49, tf...gt, ltid53, tf...gt
contains synonyms, hypernyms, and hyponyms of
element names, and semantic distances
Element Ontology Index
Course, ltSeminar, 0.9gt, ltProject, 0.7gt,
ltTeaching, 0.9gt ltTelecourse,
0.9gt, ltVideo lecture, 0.7gt, ltMeditation, 0.1gt
15
Query Decomposition Evaluation
  • decompose query into subqueries
  • choose global evaluation order of subqueries
  • represent subquery as NFSA
  • for each subquery choose local evaluation
    strategy
  • (top-down or bottom-up)
  • evaluate subexpressions using indexes
  • compute subquery result paths
  • with relevance scores
  • combine result paths into result graph

Example query
Example of subquery NFSA
Uni..(InstDept) As F And F Computer
Science And F..Course. Markov Chains
Uni..(InstDept)
Uni..(InstDept)
16
The Role of Ontologies
17
Outline
Adding relevance to XML
?
?
The XXL search engine index-based query
processing


Experiments
18
Example Data
19
Example Query
SELECT FROM INDEX WHERE drama..scene AS
C AND C.speech AS S AND (S.speaker
"Woman") AND S.line AS L AND (L.CONTENT
"leader") AND C.speech AS M AND (M.speaker
"MACBETH")
20
Example Ontology
thane (a feudal lord or baron in Scotland)
gt lord, noble, nobleman (a titled peer of the
realm) gt male aristocrat (a man who
is an aristocrat) gt leader (a
person who rules
or guides or inspires others)
21
Example Ontology
woman, adult female (an adult female person)
gt amazon, virago (a large strong and
aggressive woman) gt donna -- (an Italian
woman of rank) gt geisha, geisha girl --
(...) gt lady (a polite name for any woman)
... gt wife (a married woman, a mans
partner in marriage) gt witch (a being,
usually female, imagined to
have special powers derived from the devil)
22
Example Results
Relevance 0.0070400005 ltscenegt ltspeechgt
ltspeakergt Second Witch lt/speakergt
ltlinegt All hail, Macbeth, hail to thee,
thane of Cawdor! lt/linegt
lt/speechgt ltspeechgt ltspeakergt
MACBETH lt/speakergt ltlinegt ... lt/linegt
lt/speechgt lt/scenegt
23
XXL Runtime Measurements
Test data 100 XML documents with a total of 240
000 elements (ot.xml, nt.xml, ..., hamlet.xml,
macbeth.xml, ..., SigmodRecord.xml)
Q1 Select From Index Where .publication AS
A And A.headline XML And A.author AS
B
Q2 Select From Index Where .play AS A
And A..personae AS B And B.figure King
And B. title AS C
1 2 3 4
results top-down bottom-up w/ optimization
131 14.3 sec 694 sec 2.68 sec (incl. 0.37
sec) 2bu 1bu 3td
58 8.5 sec 3.7 sec 4.64 sec (incl. 0.33 sec) 1bu
2td 3td 4td
24
Conclusion
Research avenue
explore and leverage synergies between XML
(querying), (relevance-ranking)
IR, (domain-specific or personal) ontologies,
and machine learning (for classification,
annotation, etc.)
Goal
should be able to find results for every search
in one day (computer time) with lt 1 min
intellectual effort that the best human experts
can find with infinite time
  • pursued in CLASSIX project (joint DFG project
  • with Norbert Fuhrs group in Dortmund)
Write a Comment
User Comments (0)
About PowerShow.com