Ranked Information Retrieval on XML Data - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Ranked Information Retrieval on XML Data

Description:

Bernadette Blum, Christian Nicolaus, Markus Uhl. Ranked Information Retrieval on XML Data ... existing query languages (e.g. XML-QL, Quilt, XQL, ... XQuery) ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 49
Provided by: marku7
Category:

less

Transcript and Presenter's Notes

Title: Ranked Information Retrieval on XML Data


1
Ranked Information Retrieval on XML Data
  • Seminar Informationsorganisation und -suche mit
    XML
  • Dr. Ralf Schenkel
  • SS 2003
  • Saarland University
  • 8. Juli 2003
  • Bernadette Blum, Christian Nicolaus, Markus Uhl

2
Outline
  • 1. Introduction in Information Retrieval
  • 2. Information Retrieval on XML Data
  • 3. Approaches
  • ELIXIR
  • The ELIXIR language
  • The ELIXIR query processing algorithm
  • Experiments, Conclusion
  • XRANK
  • Data model
  • Ranking function
  • Data structures and algorithms
  • Experiments
  • 4. Conclusion

3
1. Introduction in Information Retrieval
  • Definition
  • Information Retrieval (IR) is the technology for
    searching in collections (corpora, intranets,
    Web) of weakly structured documents text, HTML,
    XML, ...
  • search engines, digital libraries, similarity
    search on scientific data
  • Vector space model (text analysis)
  • based on word occurrence frequency
  • documents and queries are vectors
  • result ranking based on similarity metric in
    vector space

4
1. Introduction in Information Retrieval (II)
  • Link analysis (structure analysis)
  • weighting documents
  • improve result ranking
  • Page rank approach (I)
  • web as directed graph G
  • random walk of a web surfer
  • follow hyperlinks with probability (1-?)
  • random jump with probability ?

5
1. Introduction in Information Retrieval (III)
  • Page rank approach (II)

?/5
q
Document
Hyperlink
(1-?)/3
?/5
random jump
(1-?)/3
(1-?)/3
?/5
?/5
?/5
Probability of random jump ?
Probability of following hyperlink (1- ?)

p(q)
random jump
hyperlinks
6
2. Information Retrieval on XML Data
  • XML standard for exchange of structured data and
    documents
  • existing query languages (e.g. XML-QL, Quilt,
    XQL, ? XQuery)
  • no ranked or weighted results based on textual
    similarity
  • but extensions (XXL, XIRQL )

2 Approaches
ELIXIR SQL-like approach
XRANK Keyword based approach
7
3.1 ELIXIR
  • ELIXIR expressive and efficient language for
    XML information retrieval
  • extension to XML-QL similarity operator
  • computed by WHIRL
  • returns best r answers

8
ELIXIR The ELIXIR language
  • Syntax
  • XML-QL Syntax (SQL-like)

output format
CONSTRUCT ltitemgtblt/gt WHERE ltitems.book
yearybgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml, yb gt
1990, b c.
pattern statements predicates
boolean operators
ELIXIRs similarity operator
  • similarity calculation even between 2 variables
    (? expressiveness)
  • no nested queries

9
ELIXIR The ELIXIR language (II)
  • WHIRL (I)
  • Word-based Heterogeneous Information Retrieval
    Logic
  • extends DATALOG with
  • only relational data
  • efficiently supports ranked IR
  • Syntax (Horn clause)

conjunction of relational predicates
output(y, a, t) - book(y, a, t), ygt1950,
ta.
output relation
input relation
boolean operator
similarity operator
10
ELIXIR The ELIXIR language (III)
  • WHIRL (II)
  • Similarity computation
  • standard IR term vector techniques
  • weighting terms (TF-IDF values)
  • cosine measure

(V Vocabulary of distinct terms Terms t ? V
Documents d, d ? RV)
11
ELIXIR The ELIXIR query processing algorithm
  • Example (naïve approach)

XML-QL query Q2
ltq2gt CONSTRUCT lttuplegtltbgtblt/gtltcgtclt/gtlt/gt
WHERE ltitems.bookgtblt/gt in db.xml,
ltitems.cdgtclt/gt in db.xml lt/gt
full cross product !
Similarity computation for every tupel (b, c)
12
ELIXIR The ELIXIR query processing algorithm
(II)
  • Problem

full cross product !
13
ELIXIR The ELIXIR query processing algorithm
(III)
  • Solution
  • not simply map the full XML data into relational
    model
  • invoke WHIRL as a subroutine (? efficiency)

Avoid generating full cross product!
14
ELIXIR The ELIXIR query processing algorithm
(IV)
Start query Q1
3 Stages intermediate queries Q2, Q3, Q4
  • 1. Partition into a set, Q21 Q2N, of XML-QL
    queries
  • avoid generating full cross product
  • ordinary predicates

2 pattern statements with variables that are
compared with a similarity predicate gt distinct
Q2j queries
  • 2. WHIRL query Q3
  • similarity predicates
  • ordered table of the r best answers
  • 3. XML-QL query Q4
  • transformation of Q3s output
  • specified XML structure by Q1

15
ELIXIR The ELIXIR query processing algorithm (V)
  • Example (Step I Partition in Q2n queries)

ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq21gt CONSTRUCT lttuplegtltbgtblt/gtlt/gt
WHERE ltitems.bookgtblt/gt in "db.xml" lt/gt
XML-QL query Q21
ltq22gt CONSTRUCT lttuplegtltcgtclt/gtlt/gt
WHERE ltitems.cdgtclt/gt in "db.xml" lt/gt
XML-QL query Q22
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
Avoid generating full cross product!
16
ELIXIR The ELIXIR query processing algorithm
(VI)
  • Example (Step II WHIRL query Q3)

ltq21gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gt lttuplegtltbgtShooting
Elvislt/gtlt/gtlt/gt
ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
q3(b) - q21(b), q22(c), b c.
WHIRL query Q3
ltq22gtlttuplegtltcgtUkrainian folk musiclt/gtlt/gt
lttuplegtltcgtBeing therelt/gtlt/gt lttuplegtltcgtMilk
cow blueslt/gtlt/gtlt/gt
17
ELIXIR The ELIXIR query processing algorithm
(VII)
  • Example (Step III XML-QL query Q4)

ltq3gtlttuplegtltbgtTraditional Ukrainian
cookerylt/gtlt/gt lttuplegtltbgtBeing and
nothingnesslt/gtlt/gtlt/gt
ltresultsgt CONSTRUCT ltitemgtblt/gt WHERE
ltq3.tuplegtltbgtblt/gtlt/gt in "q3.xml lt/gt
XML-QL query Q4
Final XML OUTPUT
ltresultsgtltitemgtTraditional Ukrainian cookerylt/gt
ltitemgtBeing and nothingnesslt/gtlt/gt
18
ELIXIR Experiments, Conclusion
  • Experiments
  • Total processing time
  • depends on details of each query and input data
  • increases marginal with number of answers r
  • increases linearly with number of similarity
    join predicates
  • Partition (Step 1) of initially query dominate
    (expensive parsing and traversing)

19
ELIXIR Experiments, Conclusion (II)
  • Conclusion
  • ELEXIR extends XML-QL by supporting
    IR-similarity-features for ranking
  • similarity joins even between 2 variables
    (expressiveness)
  • Algorithm
  • rewrite original ELIXIR query in a series of
    intermediate XML-QL and WHIRL queries.
  • no full cross product, only filtered tuples of
    variable bindings (efficiency)
  • But
  • only non-nested queries
  • strict three-stage approach may be suboptimal in
    some cases (partition)

20
XRANK Ranked Keyword Search over XML Documents
21
Introduction
  • XRANK - Keyword Search over XML documents
  • results
  • XML elements that contain all searched keywords
  • ranking
  • at granularity of XML elements
  • based on hyperlink structure
  • advantages
  • user does not have to learn a query language
  • no knowledge about the structure of XML
    documents is needed
  • generalized keyword search engine
  • (both HTML and XML are possible)

22
Data Model
  • G (V, CE, HE) collection of XML
    documents
  • V set of XML
    elements (tags and attributes)
  • CE set of
    containment edges
  • HE set of
    hyperlinked edges
  • (u,v) in CE ? v is a sub-element of
    u
  • (u,v) in HE ? u contains a
    hyperlink to v
  • contains(v,k) ? v (in)directly contains
    the keyword k

23
Example XML Graph
...
XML element
value
24
Keyword Query Results (1)
How to define results of keyword search queries
over XML documents?
elements with at least one sub-element
containining all keywords at least one
sub-element containing some keywords
elements that contain all keywords no
sub-element contains all keywords!
?
25
Ranking Elements
How to rank XML elements?
ElemRank
  • extension of PageRank at the granularity of
    elements
  • objective importance of XML elements
  • based on hyperlinked and nested structure of XML
  • elements

26
ElemRank (1)
n XML elements nc(u)
sub-elements of u nh (u) outgoing
hyperlinks from u CE-1 (v,u) (u,v) ?
CE reverse containment edges E HE
? CE ? CE -1
nc(u) 3 nh(u) 3
u
containment edge
reverse containment edge
hyperlink edge
27
ElemRank (2)
? prob. for following a hyperlink 1- ?-
?- ? prob. for a random jump ? prob. for
using a containment edge ? prob. for using a
reverse containment edge
e
? / 3 e / 10
? / 1 e / 10
? / 3 e/10
e / 10
? / 3 e / 10
? / 3 e / 10
? / 3 e / 10
e / 10
? / 3 e / 10
containment edge
reverse containment edge
hyperlink edge
28
ElemRank (3)
ElemRank e(v)
e(u) nh(u)
e(u) nc(u)
e(u) 1
(1- ?- ?- ?) 1/n ? ? ? ?
? ?
(u,v) ? HE
(u,v) ? CE
(u,v) ? CE-1
(0 ?, ?, ? 1)
random
navigation
via hyperlinks
via forward containment edges
via reverse containment edges
29
Ranking Function (1)
  • ranking functions should take into account
  • result specifity
  • hyperlinks
  • keyword proximity
  • contains(v,k)
  • ? sequence (v1,v2), ..., (vn-1,vn) s.t. vn
    directly contains k
  • r(v,k) ElemRank(vn) decayn-1 (0 decay 1)
  • based on hyperlinked structure
  • result specifity

30
Ranking Function (2)
  • m occurences of keyword k
  • computation of r1, ..., rm
  • r(v,k) f(r1, ..., rm)

(with accumulation function f - e.g. max or sum)
p proximity measure
  • query q consists of keywords k1, ..., kn
  • R(v,q) (? r(v,ki)) p(v,k1, ..., kn)
  • keyword proximity

31
ltCDsgt ltCD id 1gt lttitlegt R.E.M.
Out Of Time lt/titlegt ltsonggt
lttitlegt Radio Song lt/titlegt lttimegt
412 lt/timegt lt/songgt ltsonggt
lttitlegt Losing My Religion lt/titlegt
lttimegt 426 lt/timegt lt/songgt
... lt/CDgt ltCD id 2gt lttitlegt
R.E.M. Automatic For... lt/titlegt ...
lt/CDgt ... lt/CDsgt
32
XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
ElemRank computation
index structures algorithms
XML elements
with ElemRanks
33
Naïve Approach
  • naïve inverted list
  • contains all XML elements that contain the
    keyword

...
key1
elem11
elem12
...
key2
elem21
elem22
etc.
  • space overhead
  • spurious results
  • inaccurate ranking

34
Dewey IDs
0
ltCDsgt
...
0.0
0.1
ltCDgt
ltCDgt
...
...
0.1.0
0.0.0
0.0.1
0.0.2
lttitlegt
lttitlegt
ltsonggt
ltsonggt
R.E.M. Automatic For The People
R.E.M. Out Of Time
0.0.1.1
0.0.1.0
0.0.2.1
0.0.2.0
lttimegt
lttitlegt
lttimegt
lttitlegt
426
Losing My Religion
412
Radio Song
35
DIL Data Structure
  • Dewey inverted list
  • contains the Dewey IDs of all XML elements that
  • directly contain the keyword
  • sorted by Dewey ID (ascending)

Dewey ID
ElemRank
position list
0
0.0.0
75
R.E.M.
80
0
0.1.0

Dewey ID
ElemRank
position list
2
Religion
0.0.2.0
88

36
DIL Query Processing (1)
  • key idea computation of longest common prefix
    (lcp) of Dewey IDs

pot_result
posList 1
posList 2
DeweyID
rank 1
rank 2
1.
0
75
0
y
0
70
0
n
0
65
0
n
37
DIL Query Processing (2)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
2
70
0
n
83
n
2
0
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
38
DIL Query Processing (3)
pot_result
posList 1
posList 2
pot_result
posList 1
posList 2
DeweyID
DeweyID
rank 1
rank 2
rank 1
rank 2
2.
1.
0
0
75
0
y
88
n
2
0
2
70
0
n
83
n
2
y
0
0
65
0
n
70
0
78
2
lcp
0
65
0
73
n
2
lcp
3.
0
80
0
n
1
75
0
n
0.0 , 0
y
0
70
73
2
0
39
RDIL Data Structure
  • ranked Dewey inverted list
  • each Dewey ID in the list has a position in the
    B-tree
  • B-tree sorted by Dewey ID (ascending)
  • inverted list sorted by ElemRank (descending)

B-tree on Dewey IDs
0.0.0

0.1.0
Dewey ID
ElemRank
80
0.1.0
R.E.M.
0.0.0
75

40
RDIL Query Processing (1)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
lcp with Dewey ID11 ? result heap
41
RDIL Query Processing (2)
key1
key3
key2
B
B
B
on Dewey IDs
entry31
entry21
entry11
entry32
entry22
entry12
sorted by ElemRank
entry33
entry23
entry13
...
...
...
lcp with Dewey ID21 ? result heap
etc.
42
RDIL Query Processing (3)
key1
key3
key2
B
B
B
on Dewey IDs
entry21
entry31
entry11
entry22
entry32
entry12
sorted by ElemRank
entry23
entry33
entry13
...
...
...
? Ranking
max. reachable Ranking
threshold ?
43
RDIL Query Processing (4)
RDIL algorithm stops if threshold ? lt lowest
ElemRank in result heap because max. reachable
ranking ? lt lowest ElemRank in result heap
? max. reachable ranking lt lowest ElemRank in
result heap
!

44
XRANK Architecture
ranked result list
keyword search query
XML documents
Query Evaluator
data acces
DIL / RDIL
ElemRank computation
XML elements
with ElemRanks
45
Experimental Results (1)
46
Experimental Results (2)
47
Comparison DIL - RDIL
DIL
RDIL
  • inverted lists sorted by
  • Dewey ID
  • compute longest common prefix on Dewey IDs
  • extracts the minimum
  • of all remaining Dewey IDs
  • all lists are completely
  • scanned
  • outperforms RDIL
  • if keyword correlation is low
  • inverted lists sorted by
  • ElemRank
  • chooses next list sequentially
  • stops if a certain threshold
  • is reached
  • outperforms DIL if
  • keyword correlation is high

48
Conclusion
2 Approaches
  • ELIXIR
  • SQL-like structure based search
  • extends XML-QL by supporting IR-similarity-feature
    s for ranking
  • ranked results based only on textual similarity
    (even between 2 variables)
  • XRANK
  • keyword based search à la Google
  • ranked results based on textual similarity
  • hierarchical and hyperlinked structure
Write a Comment
User Comments (0)
About PowerShow.com