Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood - PowerPoint PPT Presentation

About This Presentation

Title:

Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood

Description:

... distance required to obtain all tuples in the approximate answer (Lemma 1) ... Lemma 2 of the paper states that the time to compute the approximate answer is ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 41

Provided by: Poulova

Category:

more less

Transcript and Presenter's Notes

Title: Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood

1
Ranking Approximate Answers to Semantic Web
QueriesCarlos Hurtado1, Alex Poulovassilis2,
Peter Wood2 1University Adolfo Ibanez, Chile
2Birkbeck, University of London
2
Outline of the talk

Motivation
Overview of our approach
Single-conjunct queries exact semantics
Approximate semantics
Multi-conjunct queries
Conclusions and future work

3
1. Motivation

Volumes of semi-structured data available on the
web
In particular, increase in the amount of RDF data
e.g. in the form of linked data
Volumes and heterogeneity of such data
necessitates support for users querying by
approximate answering techniques
users queries do not have to match exactly the
data structures being queried
answers to queries are returned in ranked order,
in increasing distance from the original query

4
2. Overview of our approach

We consider general semi-structured data,
modelled as a graph structure e.g. RDF linked
data is one kind of data that can be represented
this way
Our model is a directed graph G (V,E) where
each node in V is labelled with a constant (so
blank nodes cannot be represented)
each edge e in E is labelled with a label l(e)
from a finite alphabet ?
Our query language is that of conjunctive regular
path queries
Z1 ,..., Zm ? (X1 , R1 , Y1), ..., (Xn , Rn , Yn)
where the Xi , Yi are variables or constants,
the Ri are regular expressions over ? and the Zi
are drawn from the Xi and Yi

5
Example 1 RDF graph of a transport network
6
Find cities from which we can travel to city u5
using only airplanes as well as to city u6 using
only trains or busses ?X ? (?X, (airplane),
u5), (?X, (trainbus), u6)
7

Answer
First conjunct generates bindings u1, u4 for ?X
Second conjunct generates bindings u1, u2, u4 for
?X
Hence answer is u1, u4

8
Approximate answers

We are interested in using weighted regular
transducers to capture query approximations
since, from results by Grahne and Thomo 2001, we
know that single-conjunct queries with a weighted
regular transducer applied can be evaluated
incrementally in polynomial time
Incremental evaluation allows answers to be
returned to the user in ranked order
In this paper, we extend these this approach to
include also symbol inversion and we show that
multiple conjunct queries can also be evaluated
in polynomial time, using an algorithm from
Ilyas, Aref, Elmagarmid 2004 for computing top-k
join queries

9
Weighted regular transducers

A weighted regular transducer is a Finite State
Automaton in which the transitions are labelled
with triples rather than single symbols
a transition from state s to state t labelled
(a,i,b) means that if the transducer is in state
s then it can move to state t on input a with
cost i while outputting b
in our context, such a transition is interpreted
as stating that symbol a in a query can match
label b of an edge in the graph with cost i

10
Approximate regular expression matching

In the paper, for simplicity we mainly focus on
approximate regular expression matching, which
can be specified using weighted regular
transducers (Grahne, Thomo 2001)
The edit operations we allow are
insertions, deletions and substitutions of
symbols
inversion of symbols (i.e. edge reversal)
transposition of adjacent symbols
We envisage the user being able to specify which
edit operations should be undertaken by the
system when answering a particular query, or in a
particular application
The user could also specify the cost associated
with applying each edit operation (in the paper
we assume a cost of 1 for all of them)

11
Example 2 transport network data
12
Find cities reachable from Santiago by non-stop
flights, posed by user who has little knowledge
of the structure of the data?X ? (Santiago,
airplane, ?X)
13

The query as posed returns no answers
?X ? (Santiago, airplane, ?X)
However, the query can be relaxed, by an
insertion of name, to
?X ? (Santiago, airplane . name, ?X)
And further relaxed, by an insertion of name- to
?X ? (Santiago, name- . airplane . name, ?X)
This generates bindings of Temuco, Chillan for ?X
These answers can be regarding as having distance
2 from the original query
two insertions to the original query
each at an assumed cost of 1

14
3. Single-conjunct queries

A single-conjunct query, Q, is of the form
Z1, Z2 ? (X, R, Y)
A semipath p in graph G is a sequence of the
form
v1 , l1 , v2 , l2 , , vn , ln vn1
where for each vi , vi1 there is an edge vi
?vi1 labelled li or an edge vi1 ?vi labelled
li- in G
Semipath p conforms to regular expression R if l1
ln is in the language denoted by R

15
Exact Semantics

Given a single-conjunct query Q,
Z1, Z2 ? (X, R, Y)
Let ? be a matching from X, Y to the nodes of
graph G, that maps each constant to itself
The exact answer of Q on G is the set of tuples
?(Z1, Z2) such that there is a semipath from
?(X) to ?(Y) which conforms to R

16
4. Approximate Semantics

The edit distance from a semipath p to a
semipath q is the minimum cost of any sequence of
edit operations which transforms the sequence of
edge labels of p to the sequence of edge labels
of q
We recall that the edit operations we allow are
insertions, deletions, substitutions and
inversions of symbols, and transposition of
adjacent symbols
We envisage the user being able to specify which
edit operations should be applied by the system
when answering a particular query, or in a
particular application
The user could also specify the cost associated
with applying each edit operation (in the paper
we assume a cost of 1 for all of them)

17
Approximate Semantics

The distance of a semipath p to a regular
expression R, dist(p,R), is the minimum edit
distance from p to any semipath that conforms to
R
Given graph G, query Q and matching ?, the tuple
?(Z1, Z2) has distance dist(p,R) to Q, where p is
a semipath from ?(X) to ?(Y) which has the
minimum distance to R of any semipath from ?(X)
to ?(Y) in G
note, if p conforms to R, then ?(Z1, Z2) has
distance 0 to Q
The approximate top-k answer of Q on G is a list
containing the k tuples ?(Z1, Z2) with minimum
distance to Q, ranked in order of increasing
distance to Q
The approximate answer of Q on G is a list
containing all the tuples at any distance to Q,
ranked in order of increasing distance to Q (a
maximum of O(E)2 tuples).

18
Evaluation naive

Construct approximate automaton M at distance d
RE using a standard construction from
approximate string matching
note, RE is the maximum distance required to
obtain all tuples in the approximate answer
(Lemma 1)
M consists of d copies of MR , the NFA that
recognises L(R)
Each copy MRj , where 0 j d , represents
states at distance j from MR
The only initial state in M is the initial state
of MR0
The final state of each MRj becomes a final state
in M
Each sub-automaton MRj is connected to MRj1 by
transitions representing the selected edit
operations, and their costs (assumed 1 for
simplicity in the paper)

19
Evaluation naive

Form the product automation H M x G
viewing each node in the input graph G(V,E) as
both an initial
and a final state
3a. If Q is of the form (n,Y) ? (n,R,Y) for some
node n of G, then perform a uniform cost
traversal of graph H, starting from node (s00,n)
where s00 is the initial state of MR0
We keep a list of visited nodes of H, so no node
is visited twice.
Whenever a node (sfj,m) is encountered (where
sfj is the final state of some MRj ), we output
m.
The distance of m to Q is given by the total
cost of the path from (s00,n) to (sfj,m) in the
traversal tree.

20
Evaluation naive

3b. If Q is of the form (X,Y) ? (X,R,Y)
it can be evaluated by answering the query
(n,Y) ? (n,R,Y)
for each node n of G
Lemma 2 of the paper states that the time to
compute the approximate answer is polynomial in
V, E and R

21
Evaluation incremental

The edges of graph H M x G can be computed
incrementally, avoiding pre-computation and
materialisation of the entire H
For any state si and node n of G, succ(si ,n)
outputs the set of transitions which would be the
successors of (si, n) in H
succ calls nextStates(MR,s,c) to return the set
of states in MR reachable from state si on
reading input c this input is obtained from
the edges in G adjacent to n for normal
traversal, edge reversal and symbol insertion,
from symbols in ? for symbol deletion, and
from edges in G adjacent to n, plus a further hop
of edge traversals in G for transpositions

22
Evaluation incremental

Incremental evaluation proceeds by
Constructing the NFA MR for R
Initialising to empty the set visitedR of triples
(v,n,s) stating that node n in G was visited in
state s starting from node v
Initialising a priority queue QR with quadruples
of the form (v,v, s0,0) for each node v in G
(unless Xn in the query, in which case only
(n,n, s0,0) is enqueued)
the fourth argument is the current distance, d
initially, d 0
subsequently, quadruples are added to QR in order
of increasing d
Repeatedly calling the function getNext (X,R,Y)
to return the next answer tuple for the conjunct
(X,R,Y), in ranked order

23
Evaluation incremental

getNext (X,R,Y)
while QR is non-empty, this
de-queues a tuple (v,n,s,d) from QR where d is
the distance associated with visiting node n in
state s of MR having started from node v
adds (v,n,s) to visitedR
if s is a final state then getNext returns
triple (v,n,d)
otherwise, succ(s,n) is called, returning the set
of transitions (c,w) and states (s,m) which are
the successors of (s,n) in H
those states (s,m) such that (v,m,s) is already
in visitedR are ignored
for all other states, (v,m,s,dw) is added to QR

24
Example 4 transport network dataSuppose that
the only query edits allowable are insertion of
name or name- , and inversion of airplane. Find
cities reachable from Santiago by plane ?Y ?
(Santiago, (airplane), ?Y)
25
?Y ? (Santiago, (airplane), ?Y)

Enqueue (Santiago,Santiago, s0,0)
This is de-queued, and succ(s0,Santiago) is
called which returns transition (name-,1) and
state (s01,u1)
(Santiago,u1, s01 ,1) is enqueued
(Santiago,u1, s01 ,1) is de-queued, and succ(s01
,u1) is called this returns transition
(airplane,0) and state (sf1,u4), and
transition (airplane,0) and state (sf1,u7)
(Santiago,u4, sf1 ,1) and (Santiago,u7, sf1 ,1)
are enqueued
These are successively de-queued, resulting in
(Santiago,u4, 1) and (Santiago,u7, 1) being
successively returned by getNext
Computation continues in this way, until all
answer tuples have been returned

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
5. Multi-conjunct queries

For a general conjunctive regular path query
Z1 ,..., Zm ? (X1 , R1 , Y1), ..., (Xn , Rn , Yn)
Given a matching ? from variables to the nodes of
graph G, the tuple ?(Z1, ...,Zm) has distance
dist(p1,R1,) ... dist(pn,Rn)
to Q, where each pi is a semipath from ?(Xi) to
?(Yi) which has the minimum distance to Ri of any
semipath from ?(Xi) to ?(Yi)
The approximate top-k answer of Q on G is a list
containing the k tuples ?(Z1, ...,Zm) with
minimum distance to Q, ranked in order of
increasing distance to Q
The approximate answer of Q on G is a list
containing all the tuples at any distance to Q,
ranked in order of increasing distance to Q

32
Multi-conjunct queries

To ensure polynomial time evaluation, we require
that the conjuncts of Q are acyclic
This implies the existence of a join tree induced
by the conjuncts of Q
We use the hash ripple join algorithm of Ilyas,
Aref, Elmagarmid 2004 to incrementally evaluate Q
For each conjunct (Xi ,Ri ,Yi) of Q, we use our
incremental evaluation algorithm for
single-conjunct queries to compute a relation ri
containing triples (n,m,d) where d is the minimum
distance to Ri of any semipath from node n to
node m in G

33
Multi-conjunct query evaluation

Construct the evaluation tree E of Q
Initialise data structures calling recursively
the procedure open starting at root of E
for each node of E that is a join operator, hash
tables are built for its left and right subtree
(LN and RN), its threshold value is set to 0,
and an (initially empty) priority queue is
allocated for the node
for each node of E that is a conjunct (X,R,Y),
the same initialisations as earlier are performed
construct the NFA MR for R
set visitedR to empty and d to 0
initialise the priority queue QR

34
Multi-conjunct query evaluation

Incremental evaluation proceeds by calling a
function getNext with the root of E
If its argument is a conjunct, getNext is as
discussed earlier for single-conjunct queries
If its argument is a join operator, getNext
chooses (by some heuristic) one of the two join
operands, I, from which to retrieve a tuple, by
recursively invoking getNext
Itop is set to the distance value of the first
retrieved tuple from I, and Ibottom is updated
with the distance value of the most recently
retrieved tuple from I
The threshold value of the current node is
min(LNtop RNbottom , RNtop LNbottom)
which is the lowest possible distance for join
tuple yet to be computed

35
Multi-conjunct query evaluation join operator

The current tuple, t, retrieved from I is
inserted into Is hash table, and the other hash
table is probed with t to find possible join
combinations with t
For each such tuple s and join tuple u, the
distance of s from Q is set to the sum of the
distances of t and s from Q, and u is added to
the nodes priority queue
This process of generating and enqueueing join
tuples repeats while the priority queue remains
empty, or the distance value of the first item on
the priority queue is greater than the current
threshold value of the node
Finally, getNext returns the first item on the
priority queue

36
6. Conclusions and future work

The paper has explored the use of weighted
regular transducers and conjunctive regular path
queries in a framework for approximate querying
of graph-structured data
For single-conjunct queries we have shown how
approximate answers can be computed in polynomial
time in the size of the query and the graph
We have also shown how answers can be computed
incrementally and returned in ranked order
We have generalised the treatment to
multi-conjunct queries, showing that incremental
computation can still be achieved in polynomial
time provided the queries are acyclic

37
Conclusions and future work

There are several directions of future work
Implementation of our algorithms (ongoing),
determination of their practical utility and
efficiency, development and empirical evaluation
of optimisations
Application in case studies e.g. RDF linked data
arising in a variety of domains
Design of end-user tools for approximate querying
of semi-structured data so that users can
specify their query approximation requirements
Extending the expressiveness of our query
language, to allow path variables and predicates
on paths

38
Acknowledgements

Many thanks go to Petra Selmer for her
implementation of the incremental evaluation
algorithm, and the screenshots.

39
Corrections

Section 2.3 should state that there are O(R)
transitions between successive sub-automata for
transpositions (because only adjacent symbols can
be transposed)
Lemma 1(i) should therefore state that M has
size
O(d (R ? R R))
Examples 3 and 4 return one more answer at
distance 2 than shown, namely (sf2 ,u1) which is
reachable from (sf1 ,u4) by a transition
(airplane- ,1) (and also from (sf1,u7) by a
similar transition)

40
Corrections (contd)

There is also a mistake in our calculations in
Lemma 2 of the paper and the correct expression
is O(V E3 R)
If we assume that ? contains only labels
appearing on edges in G, then the size of the
approximation automaton M or R at distance
RE is O(E2 R), from Lemma 1.
The size of H M x G is O(E3 R), since we
can discard disconnected nodes from H.
Computing the approximate answer in the worst
case requires V traversals of H, each at cost
equal to the size of H i.e. a cost of O(V E3
R).