Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood - PowerPoint PPT Presentation

About This Presentation
Title:

Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood

Description:

... distance required to obtain all tuples in the approximate answer (Lemma 1) ... Lemma 2 of the paper states that the time to compute the approximate answer is ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 41
Provided by: Poulova
Category:

less

Transcript and Presenter's Notes

Title: Ranking Approximate Answers to Semantic Web Queries Carlos Hurtado1, Alex Poulovassilis2, Peter Wood


1
Ranking Approximate Answers to Semantic Web
QueriesCarlos Hurtado1, Alex Poulovassilis2,
Peter Wood2 1University Adolfo Ibanez, Chile
2Birkbeck, University of London
2
Outline of the talk
  • Motivation
  • Overview of our approach
  • Single-conjunct queries exact semantics
  • Approximate semantics
  • Multi-conjunct queries
  • Conclusions and future work

3
1. Motivation
  • Volumes of semi-structured data available on the
    web
  • In particular, increase in the amount of RDF data
    e.g. in the form of linked data
  • Volumes and heterogeneity of such data
    necessitates support for users querying by
    approximate answering techniques
  • users queries do not have to match exactly the
    data structures being queried
  • answers to queries are returned in ranked order,
    in increasing distance from the original query

4
2. Overview of our approach
  • We consider general semi-structured data,
    modelled as a graph structure e.g. RDF linked
    data is one kind of data that can be represented
    this way
  • Our model is a directed graph G (V,E) where
  • each node in V is labelled with a constant (so
    blank nodes cannot be represented)
  • each edge e in E is labelled with a label l(e)
    from a finite alphabet ?
  • Our query language is that of conjunctive regular
    path queries
  • Z1 ,..., Zm ? (X1 , R1 , Y1), ..., (Xn , Rn , Yn)
  • where the Xi , Yi are variables or constants,
    the Ri are regular expressions over ? and the Zi
    are drawn from the Xi and Yi

5
Example 1 RDF graph of a transport network
6
Find cities from which we can travel to city u5
using only airplanes as well as to city u6 using
only trains or busses ?X ? (?X, (airplane),
u5), (?X, (trainbus), u6)
7
  • Answer
  • First conjunct generates bindings u1, u4 for ?X
  • Second conjunct generates bindings u1, u2, u4 for
    ?X
  • Hence answer is u1, u4

8
Approximate answers
  • We are interested in using weighted regular
    transducers to capture query approximations
    since, from results by Grahne and Thomo 2001, we
    know that single-conjunct queries with a weighted
    regular transducer applied can be evaluated
    incrementally in polynomial time
  • Incremental evaluation allows answers to be
    returned to the user in ranked order
  • In this paper, we extend these this approach to
    include also symbol inversion and we show that
    multiple conjunct queries can also be evaluated
    in polynomial time, using an algorithm from
    Ilyas, Aref, Elmagarmid 2004 for computing top-k
    join queries

9
Weighted regular transducers
  • A weighted regular transducer is a Finite State
    Automaton in which the transitions are labelled
    with triples rather than single symbols
  • a transition from state s to state t labelled
    (a,i,b) means that if the transducer is in state
    s then it can move to state t on input a with
    cost i while outputting b
  • in our context, such a transition is interpreted
    as stating that symbol a in a query can match
    label b of an edge in the graph with cost i

10
Approximate regular expression matching
  • In the paper, for simplicity we mainly focus on
    approximate regular expression matching, which
    can be specified using weighted regular
    transducers (Grahne, Thomo 2001)
  • The edit operations we allow are
  • insertions, deletions and substitutions of
    symbols
  • inversion of symbols (i.e. edge reversal)
  • transposition of adjacent symbols
  • We envisage the user being able to specify which
    edit operations should be undertaken by the
    system when answering a particular query, or in a
    particular application
  • The user could also specify the cost associated
    with applying each edit operation (in the paper
    we assume a cost of 1 for all of them)

11
Example 2 transport network data
12
Find cities reachable from Santiago by non-stop
flights, posed by user who has little knowledge
of the structure of the data?X ? (Santiago,
airplane, ?X)
13
  • The query as posed returns no answers
  • ?X ? (Santiago, airplane, ?X)
  • However, the query can be relaxed, by an
    insertion of name, to
  • ?X ? (Santiago, airplane . name, ?X)
  • And further relaxed, by an insertion of name- to
  • ?X ? (Santiago, name- . airplane . name, ?X)
  • This generates bindings of Temuco, Chillan for ?X
  • These answers can be regarding as having distance
    2 from the original query
  • two insertions to the original query
  • each at an assumed cost of 1

14
3. Single-conjunct queries
  • A single-conjunct query, Q, is of the form
  • Z1, Z2 ? (X, R, Y)
  • A semipath p in graph G is a sequence of the
    form
  • v1 , l1 , v2 , l2 , , vn , ln vn1
  • where for each vi , vi1 there is an edge vi
    ?vi1 labelled li or an edge vi1 ?vi labelled
    li- in G
  • Semipath p conforms to regular expression R if l1
    ln is in the language denoted by R

15
Exact Semantics
  • Given a single-conjunct query Q,
  • Z1, Z2 ? (X, R, Y)
  • Let ? be a matching from X, Y to the nodes of
    graph G, that maps each constant to itself
  • The exact answer of Q on G is the set of tuples
    ?(Z1, Z2) such that there is a semipath from
    ?(X) to ?(Y) which conforms to R

16
4. Approximate Semantics
  • The edit distance from a semipath p to a
    semipath q is the minimum cost of any sequence of
    edit operations which transforms the sequence of
    edge labels of p to the sequence of edge labels
    of q
  • We recall that the edit operations we allow are
    insertions, deletions, substitutions and
    inversions of symbols, and transposition of
    adjacent symbols
  • We envisage the user being able to specify which
    edit operations should be applied by the system
    when answering a particular query, or in a
    particular application
  • The user could also specify the cost associated
    with applying each edit operation (in the paper
    we assume a cost of 1 for all of them)

17
Approximate Semantics
  • The distance of a semipath p to a regular
    expression R, dist(p,R), is the minimum edit
    distance from p to any semipath that conforms to
    R
  • Given graph G, query Q and matching ?, the tuple
    ?(Z1, Z2) has distance dist(p,R) to Q, where p is
    a semipath from ?(X) to ?(Y) which has the
    minimum distance to R of any semipath from ?(X)
    to ?(Y) in G
  • note, if p conforms to R, then ?(Z1, Z2) has
    distance 0 to Q
  • The approximate top-k answer of Q on G is a list
    containing the k tuples ?(Z1, Z2) with minimum
    distance to Q, ranked in order of increasing
    distance to Q
  • The approximate answer of Q on G is a list
    containing all the tuples at any distance to Q,
    ranked in order of increasing distance to Q (a
    maximum of O(E)2 tuples).

18
Evaluation naive
  • Construct approximate automaton M at distance d
    RE using a standard construction from
    approximate string matching
  • note, RE is the maximum distance required to
    obtain all tuples in the approximate answer
    (Lemma 1)
  • M consists of d copies of MR , the NFA that
    recognises L(R)
  • Each copy MRj , where 0 j d , represents
    states at distance j from MR
  • The only initial state in M is the initial state
    of MR0
  • The final state of each MRj becomes a final state
    in M
  • Each sub-automaton MRj is connected to MRj1 by
    transitions representing the selected edit
    operations, and their costs (assumed 1 for
    simplicity in the paper)

19
Evaluation naive
  • Form the product automation H M x G
  • viewing each node in the input graph G(V,E) as
    both an initial
  • and a final state
  • 3a. If Q is of the form (n,Y) ? (n,R,Y) for some
    node n of G, then perform a uniform cost
    traversal of graph H, starting from node (s00,n)
    where s00 is the initial state of MR0
  • We keep a list of visited nodes of H, so no node
    is visited twice.
  • Whenever a node (sfj,m) is encountered (where
    sfj is the final state of some MRj ), we output
    m.
  • The distance of m to Q is given by the total
    cost of the path from (s00,n) to (sfj,m) in the
    traversal tree.

20
Evaluation naive
  • 3b. If Q is of the form (X,Y) ? (X,R,Y)
  • it can be evaluated by answering the query
  • (n,Y) ? (n,R,Y)
  • for each node n of G
  • Lemma 2 of the paper states that the time to
    compute the approximate answer is polynomial in
    V, E and R

21
Evaluation incremental
  • The edges of graph H M x G can be computed
    incrementally, avoiding pre-computation and
    materialisation of the entire H
  • For any state si and node n of G, succ(si ,n)
    outputs the set of transitions which would be the
    successors of (si, n) in H
  • succ calls nextStates(MR,s,c) to return the set
    of states in MR reachable from state si on
    reading input c this input is obtained from
  • the edges in G adjacent to n for normal
    traversal, edge reversal and symbol insertion,
  • from symbols in ? for symbol deletion, and
  • from edges in G adjacent to n, plus a further hop
    of edge traversals in G for transpositions

22
Evaluation incremental
  • Incremental evaluation proceeds by
  • Constructing the NFA MR for R
  • Initialising to empty the set visitedR of triples
    (v,n,s) stating that node n in G was visited in
    state s starting from node v
  • Initialising a priority queue QR with quadruples
    of the form (v,v, s0,0) for each node v in G
    (unless Xn in the query, in which case only
    (n,n, s0,0) is enqueued)
  • the fourth argument is the current distance, d
  • initially, d 0
  • subsequently, quadruples are added to QR in order
    of increasing d
  • Repeatedly calling the function getNext (X,R,Y)
    to return the next answer tuple for the conjunct
    (X,R,Y), in ranked order

23
Evaluation incremental
  • getNext (X,R,Y)
  • while QR is non-empty, this
  • de-queues a tuple (v,n,s,d) from QR where d is
    the distance associated with visiting node n in
    state s of MR having started from node v
  • adds (v,n,s) to visitedR
  • if s is a final state then getNext returns
    triple (v,n,d)
  • otherwise, succ(s,n) is called, returning the set
    of transitions (c,w) and states (s,m) which are
    the successors of (s,n) in H
  • those states (s,m) such that (v,m,s) is already
    in visitedR are ignored
  • for all other states, (v,m,s,dw) is added to QR

24
Example 4 transport network dataSuppose that
the only query edits allowable are insertion of
name or name- , and inversion of airplane. Find
cities reachable from Santiago by plane ?Y ?
(Santiago, (airplane), ?Y)
25
?Y ? (Santiago, (airplane), ?Y)
  • Enqueue (Santiago,Santiago, s0,0)
  • This is de-queued, and succ(s0,Santiago) is
    called which returns transition (name-,1) and
    state (s01,u1)
  • (Santiago,u1, s01 ,1) is enqueued
  • (Santiago,u1, s01 ,1) is de-queued, and succ(s01
    ,u1) is called this returns transition
    (airplane,0) and state (sf1,u4), and
  • transition (airplane,0) and state (sf1,u7)
  • (Santiago,u4, sf1 ,1) and (Santiago,u7, sf1 ,1)
    are enqueued
  • These are successively de-queued, resulting in
    (Santiago,u4, 1) and (Santiago,u7, 1) being
    successively returned by getNext
  • Computation continues in this way, until all
    answer tuples have been returned

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
5. Multi-conjunct queries
  • For a general conjunctive regular path query
  • Z1 ,..., Zm ? (X1 , R1 , Y1), ..., (Xn , Rn , Yn)
  • Given a matching ? from variables to the nodes of
    graph G, the tuple ?(Z1, ...,Zm) has distance
  • dist(p1,R1,) ... dist(pn,Rn)
  • to Q, where each pi is a semipath from ?(Xi) to
    ?(Yi) which has the minimum distance to Ri of any
    semipath from ?(Xi) to ?(Yi)
  • The approximate top-k answer of Q on G is a list
    containing the k tuples ?(Z1, ...,Zm) with
    minimum distance to Q, ranked in order of
    increasing distance to Q
  • The approximate answer of Q on G is a list
    containing all the tuples at any distance to Q,
    ranked in order of increasing distance to Q

32
Multi-conjunct queries
  • To ensure polynomial time evaluation, we require
    that the conjuncts of Q are acyclic
  • This implies the existence of a join tree induced
    by the conjuncts of Q
  • We use the hash ripple join algorithm of Ilyas,
    Aref, Elmagarmid 2004 to incrementally evaluate Q
  • For each conjunct (Xi ,Ri ,Yi) of Q, we use our
    incremental evaluation algorithm for
    single-conjunct queries to compute a relation ri
    containing triples (n,m,d) where d is the minimum
    distance to Ri of any semipath from node n to
    node m in G

33
Multi-conjunct query evaluation
  • Construct the evaluation tree E of Q
  • Initialise data structures calling recursively
    the procedure open starting at root of E
  • for each node of E that is a join operator, hash
    tables are built for its left and right subtree
    (LN and RN), its threshold value is set to 0,
    and an (initially empty) priority queue is
    allocated for the node
  • for each node of E that is a conjunct (X,R,Y),
    the same initialisations as earlier are performed
  • construct the NFA MR for R
  • set visitedR to empty and d to 0
  • initialise the priority queue QR

34
Multi-conjunct query evaluation
  • Incremental evaluation proceeds by calling a
    function getNext with the root of E
  • If its argument is a conjunct, getNext is as
    discussed earlier for single-conjunct queries
  • If its argument is a join operator, getNext
    chooses (by some heuristic) one of the two join
    operands, I, from which to retrieve a tuple, by
    recursively invoking getNext
  • Itop is set to the distance value of the first
    retrieved tuple from I, and Ibottom is updated
    with the distance value of the most recently
    retrieved tuple from I
  • The threshold value of the current node is
  • min(LNtop RNbottom , RNtop LNbottom)
  • which is the lowest possible distance for join
    tuple yet to be computed

35
Multi-conjunct query evaluation join operator
  • The current tuple, t, retrieved from I is
    inserted into Is hash table, and the other hash
    table is probed with t to find possible join
    combinations with t
  • For each such tuple s and join tuple u, the
    distance of s from Q is set to the sum of the
    distances of t and s from Q, and u is added to
    the nodes priority queue
  • This process of generating and enqueueing join
    tuples repeats while the priority queue remains
    empty, or the distance value of the first item on
    the priority queue is greater than the current
    threshold value of the node
  • Finally, getNext returns the first item on the
    priority queue

36
6. Conclusions and future work
  • The paper has explored the use of weighted
    regular transducers and conjunctive regular path
    queries in a framework for approximate querying
    of graph-structured data
  • For single-conjunct queries we have shown how
    approximate answers can be computed in polynomial
    time in the size of the query and the graph
  • We have also shown how answers can be computed
    incrementally and returned in ranked order
  • We have generalised the treatment to
    multi-conjunct queries, showing that incremental
    computation can still be achieved in polynomial
    time provided the queries are acyclic

37
Conclusions and future work
  • There are several directions of future work
  • Implementation of our algorithms (ongoing),
    determination of their practical utility and
    efficiency, development and empirical evaluation
    of optimisations
  • Application in case studies e.g. RDF linked data
    arising in a variety of domains
  • Design of end-user tools for approximate querying
    of semi-structured data so that users can
    specify their query approximation requirements
  • Extending the expressiveness of our query
    language, to allow path variables and predicates
    on paths

38
Acknowledgements
  • Many thanks go to Petra Selmer for her
    implementation of the incremental evaluation
    algorithm, and the screenshots.

39
Corrections
  • Section 2.3 should state that there are O(R)
    transitions between successive sub-automata for
    transpositions (because only adjacent symbols can
    be transposed)
  • Lemma 1(i) should therefore state that M has
    size
  • O(d (R ? R R))
  • Examples 3 and 4 return one more answer at
    distance 2 than shown, namely (sf2 ,u1) which is
    reachable from (sf1 ,u4) by a transition
    (airplane- ,1) (and also from (sf1,u7) by a
    similar transition)

40
Corrections (contd)
  • There is also a mistake in our calculations in
    Lemma 2 of the paper and the correct expression
    is O(V E3 R)
  • If we assume that ? contains only labels
    appearing on edges in G, then the size of the
    approximation automaton M or R at distance
    RE is O(E2 R), from Lemma 1.
  • The size of H M x G is O(E3 R), since we
    can discard disconnected nodes from H.
  • Computing the approximate answer in the worst
    case requires V traversals of H, each at cost
    equal to the size of H i.e. a cost of O(V E3
    R).
Write a Comment
User Comments (0)
About PowerShow.com