Finding and Approximating Topk Answers in Keyword Proximity Search - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Finding and Approximating Topk Answers in Keyword Proximity Search

Description:

Finding and Approximating Top-k. Answers in Keyword Proximity Search ... C-Approximation of the Top-k Answers (Fagin et. al, ... Top Answers are Steiner Trees ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 73
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Finding and Approximating Topk Answers in Keyword Proximity Search


1
Finding and Approximating Top-kAnswers in
Keyword Proximity Search
Benny Kimelfeld and Yehoshua Sagiv
The Selim and Rachel Benin School of Engineering
and Computer Science
??????????? ?????? ????????
The Hebrew University of Jerusalem
2
Keyword Proximity Search (KPS)
  • A paradigm for data extraction
  • Data have varying degrees of structure
  • Relational databases, XML, Web sites
  • Queries are sets of keywords
  • No structural constraints

3
Querying Structure Content by Keywords
  • Keywords appear in different parts of the data
  • Answers show occurrences of keywords, as well the
    associations among these occurrences
  • Proximity of the keywords in the answer indicates
    a close (strong) semantic association among them


4
Past Work on KPS (Keyword Proximity Search)
  • DataSpot (Sigmod 1998)
  • Information Units (WWW 2001)
  • BANKS (ICDE 2002, VLDB 2005)
  • DISCOVER (VLDB 2002)
  • DBXplorer (ICDE 2002)
  • XKeyword (ICDE 2003)

5
The Goal of this Paper
  • Devise efficient algorithms for finding
    high-quality answers in keyword proximity search

6
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

7
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

8
Data Graphs
  • Structural and keyword nodes
  • Edges may have weights
  • Weak relationships are penalized by high weights

9
Queries
Queries are sets of keywords from the data graph
Q Summers , Cohen , coffee
10
Query Answers
11
Query Answers
  • An answer is a directed subtree of the data graph
  • Contains all keywords of the query
  • Has no redundant edges (and nodes)

The keywords of the query are the leaves
12
Ranking Inversely Proportional to Weight
rank(A)(weight(A))-1
Smaller subtrees represent closer associations
13
Enumerating in Exact (Ranked) Order

Then
Top-k Answers
14
Enumerating in a C-Approximate Order
C may be a function of G and Q
Then
C

C-Approximation of the Top-k Answers (Fagin et.
al, PODS01)
15
Polynomial Delay
Yardstick of efficiency Polynomial delay
Polynomial time between generating successive
answers
Exponentially many answers even for 2 keywords
(it is inefficient to generate all answers and
then sort)
16
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

17
Top Answers are Steiner Trees
  • Finding the top answer in KPS (a.k.a. the
    Steiner-tree problem) is intractable
  • Therefore, one cannot enumerate all answers in
    ranked order with polynomial delay
  • However, the top answer can be found efficiently
    under data complexity
  • That is, the number of keywords is fixed
  • Approximations can be found efficiently under
    query-and-data complexity
  • There is a lot of work on Steiner-tree
    approximations

18
So What Can Be Done?
Can answers of KPS be enumerated in the exact
order with polynomial delay, under data
complexity?
Can approximations of Steiner trees be used for
efficiently enumerating in an approximate order
(while preserving the approximation ratio)?
19
Our Results
  • Theorem 1
  • Under data complexity, answers of KPS can be
    enumerated in the exact order with polynomial
    delay

20
Our Results (contd)
  • Theorem 2
  • Under query-and-data complexity, given an
    efficient C-approximation for finding Steiner
    trees, one can enumerate with polynomial delay in
    a (C1)-approximate order

21
The Meaning of the Results
KPS is tractable under data complexity
All results on Steiner trees can be applied to KPS
Under query-and-data complexity, an efficient
enumeration in an approximate order can be done
with almost the same ratios as Steiner trees
From a theoretical point of view, using
heuristics is not the only option
  • Existing approaches to KPS are heuristics
  • Exponential delay in the worst case
  • No provable nontrivial approximation ratios

22
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

23
Lawlers Method
  • We use the technique of Lawler (1972), which is
    an iterative method for finding the top-k answers
  • Each iteration generates the next answer by
    finding the top answer under constraints
  • Lawlers method is designed for general
    (discrete) optimization problems
  • When applying it to a specific problem, one needs
    to deal with the following two issues

24
Two Problems to Solve
1. What exactly are the constraints? (That is,
how can we apply Lawlers method so that the
constraints make it possible to find top answers
efficiently?)
2. How can we find efficiently the top answer
under constraints?
25
Solving the First Problem
  • Constraints are subtrees of the graph
  • Pairwise node disjoint
  • Their leaves are exactly the keywords of the
    query

An answer satisfies the constraints if
it contains all the subtrees (i.e., a supertree)
26
Two Problems to Solve (One Left)
1. What exactly are the constraints? (That is,
how can we apply Lawler in a way that the
constraints enable finding the top answer
efficiently?)
2. How can we find efficiently the top answer
under constraints?
27
Formulation of the Second Problem
Input constraints (node-disjoint subtrees,
keywords as leaves)
Objective A minimal answer satisfying the
constraints (i.e., containing all the subtress)
Next, an algorithm that solves almost this
problem, namely
(Almost the same) Objective A minimal supertree
satisfying the constraints
28
Finding a Minimal Supertree
  • Input G, T (constraints, i.e., subtrees)
  • 1. Collapse each of the subtrees of T into a
    node
  • 2. Find a Steiner tree T of the collapsed
    subtrees
  • 3. Restore the collapsed subtrees in T

(more details in the proceedings)
29
This is not Enough!
Input constraints (node-disjoint subtrees,
keywords as leaves)
Objective A minimal answer satisfying the
constraints (i.e., containing all the subtress)
(Almost the same) Objective A minimal supertree
satisfying the constraints
30
Query Answers Revisited
  • An answer is a directed subtree of the data graph
  • Contains all keywords of the query
  • Has no redundant edges (and nodes)

Keywords are the leaves
31
An Example
32
An Example
The minimal supertree satisfying the constraints
The minimal answer satisfying the constraints
The minimal answer can be completely different
from the minimal supertree Furthermore, there can
be no answer even if there is a supertree
33
What if We Remove Edges of Constraints?
  • What if we first generate a minimal supertree and
    if the root has only one child, then we just
    remove it (until an answer is obtained)?
  • The constraints are violated, leading to a
    failure of Lawlers method!
  • That is,
  • Some answers will be duplicated
  • While other answers will not be generated at all

34
Our Approach
35
This Process is Repeated
Up to 2keywords times (fixed usually fewer)
36
About the Transformation
  • The details of the exact transformation and the
    proof of correctness are intricate
  • All can be found in the proceedings

This concludes the algorithm for enumerating in
the exact order
37
A Different View Chain of Reductions
Enumerating answers in ranked order
Finding the top answer under constraints
Finding minimal supertrees
Finding Steiner trees
38
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

39
Modifying the Chain of Reductions
Enumeration in an approximate order
Finding approximate answers under constraints
Finding approximations of minimal supertrees
Finding approximations of Steiner trees
40
Exact Order Revisited
We cannot allow it under query-and-data
complexity!
Up to 2keywords
41
The Algorithm
Constraints
C times the optimum
1 times the optimum
A C-approximation of the minimal supertree
(collapse and restore)
A minimal answer for 3 or fewer constraints (the
algorithm for the exact order)
42
Combine the Subtrees
The combined subgraph contains an answer
(C1) times the optimum
C times the optimum
1 times the optimum
A C-approximation of the minimal supertree
(collapse and restore)
A minimal answer for 3 or fewer constraints (the
algorithm for the exact order)
43
Contents
  • Introduction
  • Formal Setting
  • The Main Results
  • Enumerating in the Exact Order
  • Enumerating in an Approximate Order
  • Conclusion and Future Work

44
Keyword Proximity Search
  • A common paradigm for keyword search over
    structured databases
  • In the formal model
  • Data are directed and weighted graphs
  • Queries are sets of keywords (i.e., nodes) from
    the data graph
  • Query answers are non-redundant subtrees
    containing the keywords of the query
  • The goal is to find the top-k answers, where the
    rank is inversely proportional to the weight
  • A stronger goal enumeration with poly. delay

45
Our Results
  • Under data complexity, answers can be enumerated
    in the exact ranked order with polynomial delay
  • Under query-and-data complexity, every efficient
    C-approximation to the Steiner-tree problem
    yields an algorithm for enumerating answers with
    polynomial delay in a (C1)-approximate order

46
Our Chain of Reductions
Enumerating answers in sorted order
Lawlers approach
Finding the top answer under constraints
The intricate part
Finding minimal supertrees
Subtree Collapse/Restore
Finding Steiner trees
47
Other Variant of KPS
Our algorithms can be adapted to other popular
variants of KPS
48
Undirected Variant
Answers are undirected trees
49
Strong Variant
Answers are undirected trees and keywords are
leaves
50
Open Problems
  • Can we improve the space efficiency of our
    algorithms?
  • Some ranking functions (e.g., height) are easier
    than weight when looking for the top answer (no
    constraints), but
  • The chain of reductions doesnt work
  • The complexity of finding the top answer under
    constraints is unknown
  • Can our results hold for richer queries that also
    have structural constraints?

51
Implementation Considerations
  • Bottlenecks Steiner-tree algorithms and
    approximations
  • Thin graphs allow in-memory execution of our
    algorithms, even for large XML documents (e.g.,
    DBLP)
  • New and intuitive ranking functions that are
    easier to implement efficiently

52
Related Work Order vs. Efficiency
(Queries have a fixed size)
This work
Exact Order
Approximate Order
Past work
Heuristic Order (no approx. guaranteed)
No Order
53
Thank you.
  • Questions?

54
Illustration of Lawlers Method
55
Lawlers Method (1972)
56
1. Find the Top Answer
In principle, at this point we should find the
second-best answer
But Instead
57
2. Partition the Remaining Answers
58
2. Partition the Remaining Answers
Each partition is defined by a distinct set of
constraints
59
3. Find the Top of each Set
60
4. Find the Second Answer
The second answer is the best among all the top
answers in the partitions
61
5. Further Divide the Chosen Partition
62
And so on
63
Adapting Lawlers Method
64
Our Constraints
Inclusion constraints
  • Node-disjoint subtrees of the data graph
  • All the leaves are keywords
  • An answer must contain all the subtrees

Exclusion constraints
  • Edges of the data graph
  • An answer must not contain any of the edges

65
Partitioning a Partition (cont)
edges(A) \ I e1,,ek
A
I
A
E

66
Generating Constraints (intuition)
Constraints (subtrees/edges) are obtained from
existing constraints of the current partition
and the top answer
67
Collapsing Subtrees
68
Collapsing a Subtree
69
1. Remove All Edges and Internal Nodes
Only the root is left
70
2. Remove Incoming Edges of Internal Nodes
71
3. Add Outgoing Edges to the Root
An edge that emanates from an internal node
becomes an outgoing edge of the root
72
More Details
  • When adding an outgoing edge (r,u) to the root,
    the weight of (r,u) is the minimal weight among
    all the edges from the collapsed subtree to u
  • When restoring a subtree, each outgoing edge
    (r,u) of the root is replaced with an (arbitrary)
    original edge from the restored subtree to u,
    with the same weight
  • Incoming edges of internal nodes of the subtree
    are never restored
  • Such edges cannot participate in G-supertrees
Write a Comment
User Comments (0)
About PowerShow.com